Experimental learning: Graph database (Neo4j) as a datastore for MDM

Background

I am currently part of a Professional Service team; we provide service related to architecting, Solutioning, development, enhancement and implementation of Master Data Management. We are specialized heavily in IBM Master Data Management.

As part of the various implementations that I have been engaged on; I have had fair amount of implementation exposure to using MDM to store Customer data.

If you have been exposed to MDM (regardless of any vendor [IBM, Oracle, Golden Source]/ domain [Customer, Product, Security]); you would realize that in MDM there is a lot of relations (directed and bi-directional), properties etc are stored between entities (Customer, Email, Phone etc..). All the MDM products, that I have been exposed too, uses an RDBMS (DB2, Oracle, MYSQL etc..) as a datastore. After playing around with Neo4J, and realizing the nature of graph database as a datastore; I wanted to explore how we could use Neo4j for an MDM solution.

As part of trial play; I want to data model a MDM (for customer data model) for a retail enterprise "CrossRoads Inc.". In the following section I would demonstrate

- A customer data

- Relation between various source system and customer data

- Relationship between customer and entities (Email, Phone etc..)

- Preferences that a customer setup

- Merged nodes

I am using only the Cypher queries capability to demonstrate the above.

NOTE: This is just a POC; as such I wanted to highlight only the basic scenarios. A typical implementation would involve complex scenarios, based on enterprise needs.

POC

The purpose is just to demo, hence I am not going to walk through the basics of explaining graphs or Neo4j or Cypher queries. There are really great demo, tutorials & documentation available at the NEO4J website

Scenario : Initial setup
We start of creating the basic node that represents the enterprise "CrossRoad Inc."

CREATE (root:ROOT{company : "CrossRoads Inc.", implementor : "y"})

RETURN root;

Next we declare the nodes that identifies the various source system; from which record will be fed into MDM. CrossRoads Inc. operates with retail stores (identified by source "STORES") and also has an online ecommerce website (identified by source "ONLINE"), via which customers can buy products

MATCH (root:ROOT)

CREATE UNIQUE

root-[:HAS_SYSTEM]- (s1 :SOURCE {name:'ONLINE'}),

root-[:HAS_SYSTEM]- (s2 :SOURCE {name:'STORES'})

RETURN s1,s2;

Scenario : Creating a customer

The source system STORES has a customer record; which has an id of "STO00100".

Typically there are lot of request to search for a customer based on a souce id; since relation and its property are not yet indexable (not talking about the legacy indexing of Neo4j); I have opted to treat the source id as an entity by itself.

Customer entity are stored as a Graph Node. Using the new "Label" feature of Neo4j; we can create the "Customer" node.

An unique enterprise customer id (property "pguid") is generated and stored as part of the customer node. All the demographics information (ex : gender, ethinicity etc..) can be stored as properties of the customer node.

MATCH (src:SOURCE)
WHERE src.name='STORES'
CREATE UNIQUE
src -[:HAS_PARTY {active: 'Y',last_update_date:'02-01-2013'}]-> (pid:SOURCE_PARTY_ID {source:'STORES',systemid:'STO001000', since_date:'01-01-2013', active: 'Y'}),
pid -[:IDENTIFIES_PARTY {current:'Y'}]-> (p1:PARTY {pguid:"P1376",gender: 'M'})
RETURN p1;

Node 3 => Source node ("STORES)
Node 4 => Source Id node
Node 5 => Customer (party) node

Since there will be a lot of searches; I have opted to create the following indexes

CREATE INDEX ON :SOURCE_PARTY_ID(systemid);

CREATE INDEX ON :PARTY(pguid);

Now we can issue queries like the following

To search for a party with id "STO001000"

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p:PARTY)
WHERE pid.systemid="STO001000" AND pid.source='STORES' AND r.current='Y'
RETURN p;

To search for a party by its "guid"

MATCH (p:PARTY)
WHERE p.pguid="P1376"

RETURN p;

Scenario : Creating history on sub nodes (name)

In the next steps; Lets look at how sub entities (ex names) can be created and how historical information can be created.

When you walk into a store and if the cashier asks your name, most of the customers give partial name "J Doe" or full name "John Doe" or even fake / avatar names "Mr. Jesse Pinkmen" (Yeah Breaking bad fan here).

Retailer source system stores these name variations for a single customer record. If the customer changes the name (by calling up a CSR or something); they maintain history as to when the record was changed. new name etc.

Hence name is not stored as a property of the "Customer" entity; but rather a seperate sub-entity. The "name" node cannot exist without a customer and hence has a rel; Pation ship with the customer node. The relationship identifies name type ("Professional", "online avatar" etc..)

The following cypher query demonstrates how to create a name node and attach it to the customer node.

The customer would like to "Be Called" as "Louie C.K"

MATCH (p:PARTY)
WHERE p.pguid="P1376"
CREATE UNIQUE
p -[:HAS_NAME { nameType : "BE_CALLED", current: 'Y', source : 'STORES', sinceDate : '01-02-2013'}]-> (nm:PARTY_NAME {firstname:"C.K", lastname:"Louie"})

RETURN p,nm;

I have opted to store "source lineage" informations in the relationship rather than name node itself. By "Source Lineage" information i mean property like which source provided this information, which date was this available. Also the "current" property is an important one to node, it identifies if the relationship is active and is the most recent

Node 5 => Party / Customer node
Node 6 => Name node

On a different day; the customer came back to the store and decided to buy something. Now he decides to "Be Called" as "Charlie Sheen". In this case the following single cypher statment replaces the active "Be Called" name from "Louie C.K" to "Charlie Sheen"

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p:PARTY) -[hn:HAS_NAME]-> (nm:PARTY_NAME)
WHERE pid.systemid="STO001000" AND pid.source='STORES' AND r.current='Y'AND hn.nameType = 'BE_CALLED' AND hn.current = 'Y' AND hn.source = 'STORES'
SET hn.current = 'N'
WITH p,nm
CREATE p -[:HAS_NAME { nameType : "BE_CALLED", current: 'Y', source : 'STORES', sinceDate : '02-02-2013'}]-> (nm2:PARTY_NAME {firstname:"Charlie", lastname:"Sheen"})
,nm2 -[:PREVIOUS]-> nm

RETURN p,nm,nm2;

Node 7 => name node "Charlie Sheen"
Node 6 => name node "Louie C.K"
Node 5 => Party node

The cypher statement sets the "Louie CK" relationship current status to "N" and sets the relationship current status to "Y" for "Charlie Sheen". Also in order for faster travel it has created a "Previous" relationship between "Charlie Sheen" and "Louie CK".

Scenario : Maintaining preferences on sub nodes (email)

Customers typically has a lot of preferences on various interaction with a typical retail stores. For ex; when i walk into best buy I have preference to shop only on TV, laptops/tablets, speakers, games). I dont want to go to best buy to look for vacuum cleaners, refrigerators and stuffs. Another example I would like to receive deals, coupons on electronic items (via email) and dont share email information with third party for marketting.

Such kind of information are what are called as Preferences. Preferences are related to various entities (email, phone, address etc). A customer maintains some kind of relationship with an entity (ex : email). For example this customer ("STO001000") has an email address ("Jesse.Pinkman@breakingbad.tv") that he uses for personnel ("Home").

MATCH (p:PARTY)
WHERE p.pguid="P1376"
CREATE UNIQUE
p -[:HAS_EMAIL { usageType : "HOME", current: 'Y', source : 'STORES', sinceDate : '02-02-2013'}]-> (em:EMAIL {emailAddress:"Jessie.Pinkman@breakingbad.tv"})

RETURN p,em;

Preference related informations can now be stored as part of the relationship "HAS_EMAIL" that exists between the customer node and email node.

Node 8 => email node

Node 5 => customer node

In the above example the customer does not like to subscribeToDeals (r.subscribeForDeals='N') , have the retail store send his email to its subsidary (r.sendEmailToSubsidary='N'), but would like to receive coupon (r.receiveCoupons='Y')

Scenario : Additional entities and additional party

Before going onto the next scenario "merged parties". I wanted to create more sub entities.

Lets say the customer bought some valuable items as part of buying that item he has decided to provide realistic information like "phone", "address" etc. THe following cypher statement creates these entities and also create a relationship between the customer.

MATCH (src:SOURCE)
WHERE src.name='ONLINE'
CREATE UNIQUE
src -[:HAS_PARTY {active: 'Y',last_update_date:'02-01-2013'}]-> (pid:SOURCE_PARTY_ID{source:'ONLINE',systemid:'ON00891', since_date:'01-01-2013', active: 'Y'}),
pid -[:IDENTIFIES_PARTY {current:'Y'}]-> (p2:PARTY {name:"P2498", pguid:"P2498"})

RETURN p2;

The following cypher statements creates

Online customer with id "ON00891" and guid "P2498"
has a "avatar" name of "Jessie Pinkmen"
Uses email address "Jessie.Pinkman@breakingbad.tv" for his home/personnel use.
Have a phone number "7329876534"
and has provided a billing & shipping address; when he bought some item from online.

#Create another party p2 that is fed from online having an id "ON00891" and the party has a work email "Jessie.Pinkman@breakingbad.tv"
MATCH (src:SOURCE)
WHERE src.name='ONLINE'
CREATE UNIQUE
src -[:HAS_PARTY {active: 'Y',last_update_date:'02-01-2013'}]-> (pid:SOURCE_PARTY_ID{source:'ONLINE',systemid:'ON00891', since_date:'01-01-2013', active: 'Y'}),
pid -[:IDENTIFIES_PARTY {current:'Y'}]-> (p2:PARTY {name:"P2498", pguid:"P2498"})
RETURN p2;

#Add name node of usage type "AVATAR" to party "P2498"
MATCH (p:PARTY)
WHERE p.pguid="P2498"
CREATE UNIQUE
p -[:HAS_NAME { nameType : "AVATAR", current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> (nm:PARTY_NAME {firstname:"Jessie", lastname:"Pinkmen"})
RETURN p,nm;

#Create link for email, phone and address
MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY),(em:EMAIL)
WHERE pid.systemid="ON00891" AND r.current='Y'
AND em.emailAddress="Jessie.Pinkman@breakingbad.tv"
CREATE UNIQUE
p2 -[:HAS_EMAIL { usageType : "HOME", current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> em
RETURN p2,em;

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY),(ph:PHONE)
WHERE pid.systemid="ON00891" AND r.current='Y'
AND ph.number="7329876534"
CREATE UNIQUE
p2 -[:HAS_PHONE { usageType : "PERSONNEL", phoneType : "SmartPhone", current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> ph
RETURN p2,ph;

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY),(addr:ADDRESS)
WHERE pid.systemid="ON00891" AND r.current='Y'
AND addr.addrLine1="1320 Hollow Way Dr" AND addr.city="Matawan" AND addr.state="NJ" AND addr.Country="US"
CREATE UNIQUE
p2 -[:HAS_ADDRESS { usageType : "BILLING", current: 'Y', current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> addr
RETURN p2,addr;

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY)
WHERE pid.systemid="ON00891" AND r.current='Y'
CREATE UNIQUE
p2 -[:HAS_ADDRESS { usageType : "SHIPPING", current: 'Y', current: 'Y', source : 'ONLINE', sinceDate : '01-02-2013'}]-> (addr:ADDRESS {addrLine1:"1564 SleepyHollow Ave", city:"Matawan", state:"NJ", Country:"US"})

RETURN p2,addr;

The following query is a simple way to retrieve all the nodes, relations attached to a customer node. In othewords; by executing this query we get a complete picture of the customer information.

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY)-[*]->(n)
WHERE pid.systemid="ON00891" AND r.current='Y'

RETURN p2,n;

Scenario : Merging party nodes

In our small database; we now have a customer who has done some shopping in a local retail store and also bought some items via online store. Here is the picture of the both the customers

MATCH (pid:SOURCE_PARTY_ID) -[r:IDENTIFIES_PARTY]-> (p2:PARTY)-[*]->(n)
WHERE r.current='Y'

RETURN p2,n;

from the picture above we find that both the customer share same entities (email (node 8), phone (node 11) & address (node 12).

In a MDM solution; the matching engine component/sub-system helps in finding duplicate customers and merges the information between the customer into one single entity. Lets suppose that the above customer node are the same and we have elected to merge them. Now, we look at how this can be done. As part of merging; the old nodes should be maintained for history and origin determination.

Neo4j does not offer capabilities to merge nodes together, but this could be achieved as follows. First we start of by creating a new customer node that represents merging of the customer node from stores and also online. This could be achieved via this single cypher statement

MATCH (pid1:SOURCE_PARTY_ID) -[r1:IDENTIFIES_PARTY]-> (p1:PARTY)
WHERE pid1.systemid="STO001000" AND r1.current='Y'
WITH p1
MATCH p1,(pid2:SOURCE_PARTY_ID) -[r2:IDENTIFIES_PARTY]-> (p2:PARTY)
WHERE pid2.systemid="ON00891" AND r2.current='Y'
CREATE p1 -[:MERGED_TO { partid1:"P1376", partyid2:"P2498", merge_date:'03-01-2013'}]-> (p3:PARTY {pguid:"P5836"})
,p2 -[:MERGED_TO { partid1:"P1376", partyid2:"P2498", merge_date:'03-01-2013'}]-> (p3)

RETURN p1, p2,p3;

What the above does is that it creates a new customer / party node (guid : P5836) and creates a "Merged To" relation between Stores Customer (STO0010000) and Online customer (ON00891).

Node 15=> new customer (merged) node
Node 5 => stores customer node
Node 10 => online customer node.

At this time, Neo4j does not have a way copy relations from one node to another node. The "SET" operator can replace / copy all properties between nodes but it still not copy the relationships. The only way we could this is manually create the relationship or via programatic code (ex: via java driver classes).

In this poc; i have been keeping only cypher statement and hence did not go the route of creating these relations.

Conclusion
Though I havent covered all the scenarios, i believe the demonstration would have given some overview of how the data could be modeled in Neo4j.

I find that querying and getting the results is very easy. dont have to worry about joins and joins on multiple tables.

Speed is faster as now i can retrieve all the party information in literally one simple call.

I wanted to test the invocation via Spring Data, but i believe it is still in development for this beta version and hence did not take time to test it out.

I definetly feel that this is the modern data store and probably will be seeing this in the future versions of MDM based projects.

For now, a polyglot based implementation can be developed with the COTS MDM and Neo4j. Neo4j would store the relationships. This can avoid multiple joins and faster search and retrieval of party information.

Experimental learning

Wednesday, November 20, 2013

Graph database (Neo4j) as a datastore for MDM

Background

POC

1 comment: