Introduction
For a concept idea that I am working on I needed to get some realistic sample people data. The data fields are Name, address, email phone, occupation etc.. I know it that this data could easily be fabricated, I tried it out got more like these :
Jules abdoullah Desai,
123 Farber Lakes Dr,
Monroeville, Texas
That address is not convincing and that person may not exist. I wanted to get the sample data from a source so that there is some distribution of people across location (city, state etc..) and real valid address (well at least close enough). The following example looks more convincing for me:
Ibrahim Khan
1 Lilly Corporate Ctr
Indianapolis, IN
46285-0001
United States
This blog post details on the data source that provided me with the type of data and how I built a program and extracted information from this source. At the end of reading this post, you will have some idea of- The data source
- Method used to get the data.
- web scrapping.
- The program and libraries that helped in extracting the information.
Data source
I search to find a data source (web site) which could give me a dataset. First stop was www.whitepages.com. You could get hold a list of person by searching on a last name. The search result produced as below :
as you could see; the distribution of people by locality is there; but are too many duplicates.also the address is not in full. hence I was not satisfied with this.
On further searching and digging I finally came to www.jigsaw.com. Although Jigsaw's original intention is to host a repository of business lead contacts and provide a medium where different people could exchange this information. My intention was different but it served to be ideal. To start off, a simple search by name "Ibrahim khan" returns the following result :
from the search result I am able to get company, title and name. When I click on the name itself; I get a little bit more details as below :
Though the actual phone & email address is hidden (you need to pay for it); but other information like Name, Job Title, Organization, Address seems more realistic and valid. Hence this is a perfect data source for getting the data I need. I could live with the fact of auto generating the email address (like firstname.lastname@somerandom.org) and phone number for these contacts data.
NOTE: Jigsaw had a rest based service to retrieve the information of contacts, but at the time of this experimentation; it was closed and I was not able to get an account to access this service.
Strategy for data collection
Ideally it would be great, if we have a list of names in a database and we could query for each name. Unfortunately, Jigsaw did not provide this. A workaround is to search based on a last name (ex: carpenter) and from the list of names (which has a URL) present in the search result, iterate each and retrieve the necessary information (name, title, org, address, etc..). The only downside is for you to come up a list of common last names that you could use for the search.
You could get the list of common last names from the us census, like the list from 2000 census Genealogy Data: Frequently Occurring Surnames. The amount of data collected from each last name search would be good enough at least for my case.
For the data collection process; I am going to be doing a 2 pass on JigSaw.com. As from above, based on a list of last names
- First pass: For each last name do a search and store the resultant links for each contact.
- Second pass: For each url (obtained from the above), do a HTTP request and extract the required set of data.
Tool
Having found the ideal data source, the
next step is to find a way to extract the data from this data source.A
well known popular methodology is to use a web crawler/site scrapper.
From the time, I have known about this methodology; it has come a long way. Now you could easily create site scrappers with scrapping software (ex: automationanywhere) or employ a SAAS (ex: grepsr or mozenda) which provides sophisticated extraction and scheduling for scrapping at intervals etc..
I could simply used the above mentioned options (it would save a lot of time); but I just wanted to get my hands dirty a little, I needed to understand what is involved with scrapping and how it can be achieved. Hence I choose to program a little; now I did not want to start from a scratch I wanted to use existing well known libraries that provided
- website crawling
- a framework for crawling
- discover links and further crawl
- ability to program using a language like python, java etc..
- handle website with HTTPS sessions.
Now, there are many framework and libraries that could achieve the above, for my exploration; I choose Scrapy; as it had more references in forums (like StackOverflow) and Google search result. I also liked Scrapy in that the output will be stored as CSV, JSON, XML all based on argument parameters.
In order to do a web site scrapping; it is necessary that you have to understand the web page source html; inspect and get to know the locations/xpath etc.. I recommend using the developer tools of Firefox or Chrome to do the same. I would not be explaining this process.
In order to do a web site scrapping; it is necessary that you have to understand the web page source html; inspect and get to know the locations/xpath etc.. I recommend using the developer tools of Firefox or Chrome to do the same. I would not be explaining this process.
Scrapy and dynamic rendered web pages
Before jumping right into the code; i would have to explain how to handle java script rendered web page. I am doing this before, as i learned it the hard way of how the web page got rendered and how to use scrappy on these dynamic content
Why is there a section of this is for the reason that the search result and many other pages from Jigsaw are dynamically rendered by Javascript. Unless the web page is rendered, you will not get list of contact names, urls etc..
Crawling with Scrapy is simple as long as the web pages are pure HTML and rendered. However if the page content/page section/links that you are interested in is java script rendered then it becomes difficult. The hard way is for your scrapper to look at the whole the HTML content and determine how the data could be extracted. Another way is to use a web-page javascript rendering component/library that first reads the html and gives a rendered output, again this is hard.
Upon exploration i came a work around to do scrapping Selenium. Selenium web-driver interacts with a browser (of choice) and loads up the url. Once the page has been rendered by the browser, we could use classes in web-driver to get the rendered web page. The following will use the "Selenium Web driver" component. The logic of loading url and extraction is :
- Open the url using the "Selenium web driver"
- Selenium opens a browser instance (firefox in my case), and loads the url.
- Once the url is loaded, we wait a configurable amount of time, in order for the browser to render the page.
- Once rendered, using classes from Selenium web driver, we look for the content (sections, links etc) that are of interest to us and proceed further.
Code
The
following set of section will dwelve into how I used Scrapy to
crawl/scrape and extract the data. In order to develop the scrapper
with scrapy, it needs to be developed with Python. I am hoping that you
know Python. As of today, I am not yet an expert in Python, but enough to develop simple solutions.
The code has been developed in a windows based environment. If you would want to use this code, I leave it to you to reasearch and find out how to install python in windows and also install libraries like libxml, scrappy, selenium etc..
The code is checked in GitHub project (Will be updating this shortly) Jigsaw Scrapper.git
Extracted data
As mentioned in the strategy, the code will be based on a 2 pass/run at jig saw, in order to extract the data that is of interested. In the first pass, we will extract the following:- Contact Name
- URL for this contact
- Contact Name (First, Middle & Last)
- Contact Id
- Title
- Company or Organization
- Address
- Phone (if present)
- Added by
- Last updated.
The output format is set at the command line of scrapy. For the first pass, the data was stored in a CSV format and for the second pass i stored the output in a JSON format.
Code logic
First pass : Searching and capturing the contact urls
As mentioned earlier in the first pass, the scrapper does a search on the last name, iterates over the result set and stores the contact name and contact URL.This is done by the class "SearchAndRetrieveContactsURL.py". The class extends from CrawlSpider class of Scrapy which is meant for crawling.
For the list of last names, i have initialized it in an array (though this could be done be reading a file as well). The start_urls is empty as we are going to override the default loading procedure.
We over ride the method "start_requests"; to handle the logon process. We also instantiate the Selenium Webdriver and load up the logon URLS. This ensures the selenium instantiated browser is loaded and logged in.
If the login is successful; the "after_login" method is called up. This then checks if the login was indeed a success and redirects to the search contact page.
Once on the search page, a search is done using a last name (initialized at the class level) and once we get the search result we parse the result. This is done by method "getLastNameSearchURL".
The "parse_searchResult" then iterates over the result and captures the contact name and URL for the contact and if there is next_page link; then it crawls to the next page in a recursive fashion. The parsing of the search result page is done from the webpage that gets rendered in the browser (that was instantiated by Selenium).
On finishing one last name, the next last name is retrieved and searching and fetching process continues until all the names have been completed.
Second Pass: Fetching contact data
Given the contact url, I found that it is not mandatory to logon to JigSaw, a simple URL load, gets you the page and it is rendered and hence no need to worry about dynamic rendering.
This is done by the scrapper class "RetrieveContactDetails.py". The scrapper loads the text file containing the contacts urls and for each url, it does a HTTP get extract data content and goes onto the next one.
The class extends from the "BaseSpider" class of Scrapy, as crawling is not needed.
Retrospective
As mentioned earlier, there are much simpler solutions to web site scrapping using software's and SAAS providers. If its a one time based setup, I would get a software, if scrapping is needed to be done on a regular basis, I would recommend exploring the SAAS provider based solution, as it relieves you in cost of hardware, skill set etc..There are also various books on site scrapping for programmers to start off Web bots, Spiders, and Screen Scrapers.
Before employing any of the above, it is better that you would understand what you want to scrap, copy right issues, authentication mechanism etc..
Last words
Till next time, "Vanakkam"
Where do you define JigsawSearchContactResultItem?
ReplyDeleteGood explanation, Thank you. But I find that exploring the index (of companies or contacts) is better and more straightforward than doing a search by lastname.
ReplyDeleteI f you go through the companies directory like for example this one here(https://connect.data.com/directory/company/g) you will access more contacts that are already organized in tables, rather than accessing each contact page.
Thank you for the pointer.
DeleteAwesome and well informative post
ReplyDeletephp development