Thursday, January 24, 2013

Scrapping away at the JigSaw

Introduction

For a concept idea that I am working on I needed to get some realistic sample people data. The data fields are Name, address, email phone, occupation etc.. I know it that this data could easily be fabricated, I tried it out got more like these :

Jules abdoullah Desai,
123 Farber Lakes Dr,
Monroeville, Texas

That address is not convincing and that person may not exist. I wanted to get the sample data from a source so that there is some distribution of people across location (city, state etc..) and real valid address (well at least close enough). The following example looks more convincing for me:

Ibrahim Khan
1 Lilly Corporate Ctr
IndianapolisIN 46285-0001
United States
 This blog post details on the data source that provided me with the type of data and how I built a program and extracted information from this source. At the end of reading this post, you will have some idea of
  • The data source
  • Method used to get the data.
  • web scrapping.
  • The program and libraries that helped in extracting the information.

Data source

I search to find a data source (web site) which could give me a dataset. First stop was www.whitepages.com. You could get hold a list of person by searching on a last name. The search result produced as below :


  
as you could see; the distribution of people by locality is there; but are too many duplicates.also the address is not in full. hence I was not satisfied with this.

On further searching and digging I finally came to www.jigsaw.com. Although Jigsaw's original intention is to host a repository of business lead contacts and provide a medium where different people could exchange this information. My intention was different but it served to be ideal. To start off, a simple search by name "Ibrahim khan" returns the following result :




from the search result I am able to get company, title and name. When I click on the name itself; I get a little bit more details  as below :


Though the actual phone & email address is hidden (you need to pay for it); but other information like Name, Job Title, Organization, Address seems more realistic and valid. Hence this is a perfect data source for getting the data I need. I could live with the fact of auto generating the email address (like firstname.lastname@somerandom.org) and phone number for these contacts data.

NOTE: Jigsaw had a rest based service to retrieve the information of contacts, but at the time of this experimentation; it was closed and I was not able to get an account to access this service.

Strategy for data collection

Ideally it would be great, if we have a list of names in a database and we could query for each name. Unfortunately, Jigsaw did not provide this. A workaround is to search based on a last name (ex: carpenter) and from the list of names (which has a URL) present in the search result, iterate each and retrieve the necessary information (name, title, org, address, etc..). The only downside is for you to come up a list of common last names that you could use for the search.

You could get the list of common last names from the us census, like the list from 2000 census Genealogy Data: Frequently Occurring Surnames. The amount of data collected from each last name search would be good enough at least for my case.

For the data collection process; I am going to be doing a 2 pass on JigSaw.com. As from above, based on a list of last names
  1. First pass:   For each last name do a search and store the resultant links for each contact.
  2. Second pass: For each url (obtained from the above), do a HTTP request and extract the required set of data.
Also to note is that your program should have logged into Jigsaw and get a HTTPs session, in order to retrieve all the data from the search result.  

Tool

Having found the ideal data source, the next step is to find a way to extract the data from this data source.A well known popular methodology is to use a web crawler/site scrapper. 

From the time, I have known about this methodology; it has come a long way. Now you could easily create site scrappers with scrapping software (ex: automationanywhere) or employ a SAAS (ex: grepsr or mozenda) which provides sophisticated extraction and scheduling for scrapping at intervals etc..

I could simply used the above mentioned options (it would save a lot of time); but I just wanted to get my hands dirty a little, I needed to understand what is involved with scrapping and how it can be achieved. Hence I choose to program a little; now I did not want to start from a scratch I wanted to use existing well known libraries that provided 
  • website crawling
  • a framework for crawling
  • discover links and further crawl
  • ability to program using a language like python, java etc..
  • handle website with HTTPS sessions. 
Now, there are many framework and libraries that could achieve the above, for my exploration; I choose  Scrapy; as it had more references in forums (like StackOverflow) and Google search result. I also liked Scrapy in that the output will be stored as CSV, JSON, XML all based on argument parameters.

In order to do a web site scrapping; it is necessary that you have to understand the web page source html; inspect and get to know the locations/xpath etc.. I recommend using the developer tools of Firefox or Chrome to do the same. I would not be explaining this process.

Scrapy and dynamic rendered web pages


Before jumping right into the code; i would have to explain how to handle java script rendered web page. I am doing this before, as i learned it the hard way of how the web page got rendered and how to use scrappy on these dynamic content

Why is there a section of this is for the reason that the search result and many other pages from Jigsaw are dynamically rendered by Javascript. Unless the web page is rendered, you will not get list of contact names, urls etc..
Crawling with Scrapy is simple as long as the web pages are pure HTML and rendered. However if the page content/page section/links that you are interested in is java script rendered then it becomes difficult. The hard way is for your scrapper to look at the whole the HTML content and determine how the data could be extracted. Another way is to use a web-page javascript rendering component/library  that first reads the html and gives a rendered output, again this is hard.
Upon exploration i came a work around to do scrapping Selenium. Selenium web-driver interacts with a browser (of choice) and loads up the url. Once the page has been rendered by the browser, we could use classes in web-driver to get the rendered web page. The following will use the "Selenium Web driver" component. The logic of loading url and extraction is :
  1. Open the url using the "Selenium web driver"
  2. Selenium opens a browser instance (firefox in my case), and loads the url.
  3. Once the url is loaded, we wait a configurable amount of time, in order for the browser to render the page.
  4. Once rendered, using classes from Selenium web driver, we look for the content (sections, links etc) that are of interest to us and proceed further.

Code

The following set of section will dwelve into how I used Scrapy to crawl/scrape and extract the data.  In order to develop the scrapper with scrapy, it needs to be developed with Python. I am hoping that you know Python. As of today, I am not yet an expert in Python, but enough to develop simple solutions.

The code has been developed in a windows based environment. If you would want to use this code, I leave it to you to reasearch and find out how to install python in windows and also install libraries like libxml, scrappy, selenium etc..

The code is checked in GitHub project  (Will be updating this shortly) Jigsaw Scrapper.git

Extracted data

As mentioned in the strategy, the code will be based on a 2 pass/run at jig saw, in order to extract the data that is of interested. In the first pass, we will extract the following:
  • Contact Name
  • URL for this contact
This will be stored in a text file ("ContactsToURL.csv"). In the second pass, we retrieve the details of the contact and store in a text file. The details of the contacts that will be extracted are :
  • Contact Name (First, Middle & Last)
  • Contact Id
  • Title
  • Company or Organization
  • Address
  • Phone (if present)
  • Added by
  • Last updated.
The definition of above data structure is defined in the "Items.py" file as required by Scrapy.
The output format is set at the command line of scrapy. For the first pass, the data was stored in a CSV format and for the second pass i stored the output in a JSON format.

Code logic

First pass : Searching and capturing the contact urls

As mentioned earlier in the first pass, the scrapper does a search on the last name, iterates over the result set and stores the contact name and contact URL.

This is done by the class "SearchAndRetrieveContactsURL.py". The class extends from CrawlSpider class of Scrapy which is meant for crawling.

For the list of last names, i have initialized it in an array (though this could be done be reading a file as well). The start_urls is empty as we are going to override the default loading procedure.

We over ride the method "start_requests"; to handle the logon process. We also instantiate the Selenium Webdriver and load up the logon URLS. This ensures the selenium instantiated browser is loaded and logged in.

If the login is successful; the "after_login" method is called up. This then checks if the login was indeed a success and redirects to the search contact page.

Once on the search page, a search is done using a last name (initialized at the class level) and once we get the search result we parse the result. This is done by method "getLastNameSearchURL".

The "parse_searchResult" then iterates over the result and captures the contact name and URL for the contact and if there is next_page link; then it crawls to the next page in a recursive fashion. The parsing of the search result page is done from the webpage that gets rendered in the browser (that was instantiated by Selenium).


On finishing one last name, the next last name is retrieved and searching and fetching process continues until all the names have been completed. 

Second Pass: Fetching contact data

Given the contact url, I found that it is not mandatory to logon to JigSaw, a simple URL load, gets you the page and it is rendered and hence no need to worry about dynamic rendering. 

This is done by the scrapper class "RetrieveContactDetails.py". The scrapper loads the text file containing the contacts urls and for each url, it does a HTTP get extract data content and goes onto the next one.

The class extends from the "BaseSpider" class of Scrapy, as crawling is not needed.

Retrospective

As mentioned earlier, there are much simpler solutions to web site scrapping using software's and SAAS providers. If its a one time based setup, I would get a software, if scrapping is needed to be done on a regular basis, I would recommend exploring the SAAS provider based solution, as it relieves you in cost of hardware, skill set etc..

There are also various books on site scrapping for programmers to start off Web bots, Spiders, and Screen Scrapers.

Before employing any of the above, it is better that you would understand what you want to scrap, copy right issues, authentication mechanism etc..

Last words


Till next time, "Vanakkam"

Thursday, January 3, 2013

Integrating Google App Engine & Quickbooks

Introduction

   Based of a need from a friend; he wanted to me to build a very simple web-site which allowed
  • An user to quickly search for Items (that he had defined in his quickbook accounting software).
  • Show the details of the item, including price & quantity that he had in the warehouse.
  • An automated process that will sync with quickbooks say every 30 min.
  • The development & maintenance cost should also be low.
  • Develop custom search criteria, not supported by default query in quick book. 
   There are many options for this ex:
  • E-commerce software which integrates with quickbook.  (This was the best option and we took this approach).
  • Open source CMS (ex: Wordpress) + Commercial Plugin that integrates with quickbook and host the same.
  • An app from Intuit Marketplace.
  • Custom in-house development and hosting.

Technologies

As an opportunity to experiment / learn, I took this scenario to implement a solution based on the following technologies :
  •   Google App Engine (GAE).
  •   Quickbook's Web Connector.
  •   Google datastore.
  •   Google Web Toolkit (GWT).

Pre-Requisite

I am going to assume that you know the above mentioned technologies/at least heard of the same. If not please do a search on the same; there are tons of tutorials/articles about them that I don't want to repeat.

Why Quick book web connector (QBWC)?

Quick book offers various integration options (ex: QB SDK, Intuit Web APP, Quickbook Web connector). I choose Quickbook web connector for the reasons I am more proficient in java vs .Net.

Since QBWC is the one which initiates the conversation with the hosted webservice; this is also ideal. This means that we would not have to expose quickbook to the internet. No headaches of opening firewall ports etc..

Since the communication implemented webservice and QBWC is HTTPS and authenticated session; this also makes it ideal that eves drop of traffic will not happen.

Why GAE ?

For the website, I am choosing to develop and host in Google Cloud as a Google App Engine (GAE). Some of the reasons I choose GAE are :
If the amount of data usage/ page traffic is less (based on daily quota); you could practically run this free.
GAE manages instances based on input requests and they are load balanced too. Instances are spawned & managed automatically; and less administrative headaches.
The administrative dashboard GUI also offers all monitors and controls across various instances.


I am sure that other PAAS/SAAS providers like Amazon, SalesForce also offer something similar, for now I am sticking with GAE and not comparing the differences between them.


For the data store I choose Google data store over MySQL; just so I can experiment with a No-SQL based datastore.I am not going to challenge the fact that MySQL (in google cloud) is better option.

Flow

  • QBWC will run in the local infrastructure as a background service and communicates with QuickBook. At regular intervals of time, It will invoke the identified / hosted web service (QB sync service) and handle the various requests.
  • Data will be stored in the Google Datastore.
  • Users interact via the various servlets/GWT modules exposed as front-end GUI.

Source code

In the below sections I am going to give just a brief overview of the identified components as to how it was implemented. For your reference, The code is hosted in GitHub : GAEqB
The demonstrated code is just a proof of concept (POC) code; to demonstrate the solution works. 

QB Sync service

In order to interact with QuickBooks, using QBWC we would have to implement the webservice as  defined by the   QBWC WSDL. QBWC initiates and communicates to the webservice.

According to QBWC guide, the java code was developed using Apache Axis. This might not work in GAE, hence we have to generate the service from scratch. In order for us to develop a web service, that follows the above service contract, and host in GAE; we would have to do a bottom up approach. This means that we define the java class (with appropriate @webmethod annotations) and then use java's wsgen utility to generate the wsdl and JAXB classes. 

In order to follow the namespace; the service definition class has to be part of the package "com.intuit.developer". Refer to the implemented class "com.intuit.developer.QBSyncService". All the necessary service method has been defined. As mentioned above I was able to use the wsgen utility to generate the wsdl / jaxb related classes.

As for the POC; the implementation class does the following :


  • Authenticates the user credentials from QBWC and sends a valid response for further communication.
  • Sends a request to get defined item list (50 rows at a time)
  • On receiving the item list; parses the response and stores the item as entities in the google data store.  
Unlike CXF & Axis, defining the service does not mean that the service will be exposed and client could interact directly. With GAE, we would have to develop a servlet (pass-thru) which would take the request and forward to appropriate service method. The return from the service method is passed back as response. This means that we also would have to marshall and un-marshall the soap request/response.

In the source code; refer to the class "ca.effacious.professional.gaeqb.qbsync.QBSyncServlet" for the servlet which does the above.

QBWC's configuration files

In order for QBWC to communicate with the web service; we would have to configure the same. Configurations for QBWC are first defined in an xml based configuration file; this is then added to the QBWC application list. The configuration file contains such things as url for the web-service; url for support; fileid; owner id etc.. More details on this configuration can be found in the QBWC programmers guide.

Local QBWC

For a local/developnment QBWC to GAE (local) server, there is no magic. The GAE servlet was also not https based hence it was pretty plain. A sample definition is found in the file "data/GAEqB.qwc".

GAE Hosted service + QBWC

When the service is hosted in GAE; the above mentiond non-https based communication will not work.QBWC mandates that in production environment it requires the communication to be HTTPS based. At the least QBWC checks the presence of a security certificate.

GAE apps can be communicated via HTTPS; but the certificate is more generic certificate that is same across all hosted apps. For a proper implementation; The app url would have to be set as part of a domain and you would have to get a certificate for the domain.

I dont have a domain, neither a certificate (paid service). I found a work around though (not suggested for real production service).First i got a temporary certificate from verisign (/GAEqB/war/GAECert.cert). I then packaged the same as part of the web application. Once deployed the certificate can be access via URL. Now in the QBWC configuration file (/GAEqB/data/GAE_GAEqB.qwc); I defined the above URL as a :CertURL" parameter. This hack/work-around worked; QBWC does not seem to validate the credentials etc.. on it.

Google Data Store (GDS)

As  mentioned earlier. Based on the response; the item information were parsed and specific attributes (name, full name etc..) were choosen and stored as "Item Entities" in the GDS. Instead of having the item entities as rootless; they all belonged to a common ancestor "ITEMParent". For the Entity Key, I used the listId as the key.

Refer to class "ca.effacious.professional.gaeqb.service.ItemSyncStore"; for the implementation.

GWT

For the front end; I developed a simple GWT module servlet. The servlet had a item query text field, into which an user would put in the name of the item to be searched. The servlet would look an item based on the name (name was indexed property); once found it would populate the resultant table.

Retrospective

  • This POC proved that developing a GAE app that interacts with Quick book is possible.
  • Building an app like this GAEqB is also a cheaper custom option to develop and host; as Google cloud is cheaper.
This POC had developed a base level code (QBWC WSDL services, pass thru servlet); this could now be enhanced to build a complete website on this. If you have such an idea, we could talk!!!

Retrospective

Last words



Till next time, "Vanakkam"