The Mad Grapher: graph theory

Showing posts with label graph theory. Show all posts

Monday, 27 February 2012

Fun with Gremlin (not related to the movie or the car))

So you have all of this graph-tacular data in your graph database (for this post, I'm using neo4j). It looks slick with its vertices and edges. People stop by your desk to say, "Is that the new connect-the-dots app you're working on?"

After staring at them for a couple moments and repressing the urge to sell them the "app" for five bucks, you start thinking about how you're going to access and use this wonderful data.

"If only there was a way to query the data..." you wonder.

While SQL is the querying language of choice for relational databases, there is no real "standard" as far as NoSQL databases go (that's a subject for another time).

In the world of graph-driven databases, there are options:

Gremlin; a Groovy-based querying language that can handle any type of query. This language is perhaps geared more towards those with more of a math- or graph-based background as the syntax is nothing like SQL.
SPARQL; a popular query language for RDF graphs. This language is likely more easily picked up by those with an SQL background as the syntax is more SQL-like than Gremlin.
Cypher; a neo4j-specific querying language. This language currently only allows read-only queries of graphs (i.e. no inserts, updates or deletions). Like SPARQL, Cypher also takes its cues from SQL.

Neo4j folks: At the time of this writing, neo4j comes pre-loaded with plugins for both Gremlin and Cypher. (As I understand things, there is currently a ticket open in the neo4j community to develop a SPARQL plugin, but has not yet been completed; there are likely complications stemming from the fact that neo4j is not an RDF database at its core.)

In this article, I'm going to cover some basics in using Gremlin. By using some concrete examples, I hope to demonstrate a bit of the power behind using a graph-based database!

For the purposes of this post, I'll be making use of a small graph I created that contains some people, who they know, and what they've purchased. I expect this graph to grow and change as time goes on, but, that's where it stands for now (this is the beauty of schema-less data).

So far, Gremlin is the only querying language I've used with graph databases, hence this article making use of Gremlin. The good thing about this Gremlin is that it won't require a new muffler and you don't have to worry about feeding it after midnight.

Getting started with Gremlin and Neo4j is easy enough. It's a plugin that comes with Neo4j, so all you need to do to begin is to open up your Neo4j web admin instance, click the "Console" menu option at the top, and select the "Gremlin" option from the top-right of the console that appears.

At this point, you're faced with the currently-available variables and a gremlin overlooking them, like so:

We see that the variable g contains the current graph. If we issue the query "g.V" (without the quotes, as always), we get a list of all the vertices (nodes) in the graph; however, this information is not incredibly useful as you're only given each node's ID.

Let's say some (but not all) of our nodes have been given the property "Name". If we try using "g.V.Name", we'll again see a listing of the nodes; however, the value of the "Name" property for each node (if available) will appear (if "Name" isn't a property of a specific node, "null" will appear; also, note that Gremlin is case-sensitive).

We can also similarly view a list of the edges (relationships) by issuing "g.E" to the console; this time, however, in addition to the edge IDs, we also see the type of relationship and the adjoining vertices (nodes), e.g. 1-KNOWS->6. Note that you can see that these edges are directed! In this case, we see that the edge goes out from node 1 and goes in to node 6. Useful stuff.

We can also (similarly) view a property of the edges (if it exists) as we did for the vertices, e.g. "g.E.Quantity".

Identifying individual vertices and edges is simply a matter of knowing each one's ID number. Obtaining a reference to a node (which, yes, can be assigned to variables for easier reference) involves a call like "g.v(6)" or "g.e(3)" (note the casing).

You can examine the value of an individual node's/relationship's specific property by querying it much like we did above, e.g. "g.v(6).Name".

Want to know all of the edges coming out of a node? "g.v(6).outE" will do the trick. Similarly, if you want to know all of the edges coming in to a node, we can use "g.v(6).inE".

We can also go one step further and use the "inV" and "outV" steps to identify the nodes on the ends of edge. "inV" will correspond to the node at the head of an edge (also known as the "incoming vertex"), whereas "outV" will correspond to the other side of an edge (an "outgoing vertex").

You can also use "bothV" and "bothE" to get both incoming and outoing vertices and edges (respectively).

So if you want to travel from one node to another, you might do something like: v(1).outE.inV. You can shorten this by using: v(1).out. There exist similar constructs for "in" and "both". (Note that for "in", it's a short-cut for "inE.outV".)

Have more than one relationship connecting a node? No problem! You can access specific ones via something like this: v(1).out('LIKES') (this will take you to the node on the other end of the 'LIKES' relationship for node v(1)).

(Gremlin's github page has a good basic tutorial about all this.)

I could go on at great length about the features of Gremlin, but I think this is a great starting point. I'm going to include some concrete examples below. If I use any constructs or syntax that doesn't make sense, I very much encourage you to visit the Gremlin wiki page to look up the answers; this is a good little exercise, especially for the "groupCount" and "cap" constructs.

The graph below assumes one that has nodes describing products and people, and has relationships showing purchases and who knows whom.

How many times has each product been purchased?

g.V.inE('PURCHASED').inV.ProductType.groupCount.cap

Which products have been purchased more than once?

g.V.filter{it.inE('PURCHASED').count() > 1}.ProductType

Who is known by more than one person in this graph (we define 'knowing' as sharing an edge/relationship--in or out--with another person)?

g.V.filter{it.bothE('KNOWS').count() > 1}.Name

Who knows the most people, and how many people do they know?

g.V.bothE('KNOWS').outV.Name.groupCount.cap

How well-known is each person?

g.V.both('KNOWS').Name.groupCount.cap

(By the way, if anyone notices anything wrong with anything above, please let me know as I'm always looking to evolve and develop my knowledge of, well, everything.)

So you begin to see the power of what we can extract out of a graph database! Personally, I'm tempted to find out "who is your daddy and what does he do?" Such a question would be relatively straight-forward to figure out!

Ok, I think that's enough for now. Hopefully this is a decent (but brief) introduction to the world of Gremlin and querying graph databases. I know writing this has forced me to examine in more depth exactly what exactly these queries actually do.

I'll see you next time!

Tuesday, 14 February 2012

Basic Graphs

The purpose of this post is to give a common footing for those reading to understand what I mean when I talk about a "graph". The field of graph theory is a very deep and well-explored one. The real trick here is trying to give an introduction to graphs and graph theory without getting lost in too much detail. I will endeavour to keep this post to the utter, simplified basics of graphs.

(My apologies to anyone who already knows all about graph theory; this is meant more for those who might not have been exposed to graphs before, as well as to establish some common terms and nomenclature I'll be using in this blog. If some of the definitions seem too simplistic, please bear in mind I'm trying to just get the basics out there so anyone can understand the concepts.)

(And really, Wikipedia is a great resource for those wanting a slightly more in-depth and well-structured discussion on graphs and graph theory.)

When most people think "graphs", they think of a series of bars or lines depicting their Q4 growth projections. (Ok, most marketing and sales people I talk to do, anyway.)

We also think of graphs like the following (which is one of my personal favourites from our good friend Jay-Z):

(Source: http://mammon.typepad.com/photos/uncategorized/2007/09/13/99problems_graph.jpg)

Alas, these are not the graphs I'm referring to in this blog. The kind I'm talking about is a little bit different; though, at its core, perhaps we can take a graph to mean (at a very oversimplified level), "a graphical representation of data".

The kind of graph I'm referring to in this blog has to do with the representation of objects and their interconnections. Because I'm terrible at drawing (and lazy), I'll show a simple graph that illustrates what I'm saying and then try to describe what's going on.

This, my graph-fiends (yes, I just made that horrible pun up), is one example of a graph. What does this graph consist of? Here's what we have:

6 vertices or nodes (the circles), and,
7 edges or relationships (the lines between the circles).

At a first glance, that's all there is to it. In mathematics, we commonly use the terms "vertices" and "edges" (the vertices being the circles/points, the edges being the lines between); however, it's likely a bit more intuitive to most to use the terms "nodes" and "relationships" in place of "vertices" and "edges", respectively. I'll try to stick to using "nodes" and "relationships", though I may occasionally use the other terms interchangeably.

Is a graph still a graph if there are no edges involved? Absolutely. This is one example of a "null graph". (For those interested, at the other end of the spectrum, a "complete graph" is one whose vertices are all connected/adjacent to each other, for a total of n(n-1)/2 edges, where n is the number of vertices.)

Edges/relationships can be "undirected" (like the example above), or "directed" (like the example below). Directed edges have a source and a destination (this is useful for showing relationships between two objects). Undirected edges can also be used to show a bi-directional relationship (i.e. one that exists in both directions); for example, Tom likes Dana, and Dana likes Tom.

Still with me?

So, we know that a graph consists of a series of vertices or nodes that can be connected via edges or relationships. These edges can be undirected or directed.

At its core, that is what a graph is.

By now, if you haven't already, you're seeing how this might be useful to represent highly connected data. Social networks are a great example (think of how you could represent your "friendships" on Facebook or MySpace with a graph; this is actually called a "Likes" graph). Try to think of other information you could represent in graph fashion.

What about a family tree? For sure. As another matter of fact, a tree is actually a type of graph that is directed and acyclical (meaning that there are no loops, i.e. if you start following the directions on a graph, you travel each node exactly once).

Weight is another concept we come across in graph theory and graph databases. Think of weight as being how important or strong a relationship is between two nodes. For example, a higher weight can imply a stronger relationship between two nodes representing, say, two friends.

This is an important concept as we can use it to determine a shortest path or least costly path between two nodes. (See an explanation of paths further below.)

One slightly more complicated idea is that of degree, which represents the number of edges attached to a node. So, a node with a high degree will be one that has a lot of relationships attached to it. This can be very useful for determining just how popular, for example, something or someone is. (Cue jokes for this blog.)

One last concept to get out there before wrapping this article up is the concept of a walk or traversal. A traversal is taken to be the resultant path found by starting at a node X and following to some other node on the graph Y and the edges and nodes visited during the trip from X to Y. This concept becomes important as we start to look in to graph-based databases as traversals are part of what we can use to extract useful information from a graph (e.g. "Who are all of my cousins?") and to even predict and recommend a product for someone based on their browsing history.

Ok, I think that's enough math-like stuff for one post. The definitions given in this post don't even scratch the surface of what's out there in the world of graph theory. While I'm no serious graph theorist academically, I'm sure there are some standard texts out there. That said, as mentioned earlier, Wikipedia is really a great resource for exploring the world of graph theory a bit more.

I'll introduce other graph-related concepts as they become needed (e.g. searching or path-finding); however, if there are requests for discussing specific graphing concepts, I'd be happy to address them (provided I can speak intelligently to them, of course).

Now go weird your friends out with jokes that their family tree isn't really a tree as it has loops!

So, You Want to Be a Grapher?

I've managed to resist the urge to setup a blog, until now. A good friend and colleague of mine convinced me to do this based on a discussion we had over a beer the other night (and, let's face it: that's always the best place to have such ideas). So, Martin, thanks for that. I think.

So, the point of this blog, you wonder anxiously? What a great question for a segue into an introduction!

I've been working in the IT field as a software engineer for some time and am currently the VP, Technology for a downtown-Toronto based software development firm. As such, it behooves me (what a great term) to at least try to keep up-to-date with emerging technologies, especially as they mature.

To that end, as of late, I've become fascinated with the whole NoSQL paradigm. Having spent most of my professional career dealing with RDBMSes, I was curious as to how the whole Big Data notion fit into things. Browsing through the myriad niches of NoSQL--from document-based to key-value based--and leaning more about the whole movement along the way, I came across one particular type of NoSQL database that really got me glued to the ceiling.

Graph-based databases.

Coming from a background in not just computer science and software engineering, but mathematics as well (I attended the University of Waterloo up here in Ontario, Canada), the graph paradigm spoke volumes to me.

Sure, my math as it pertains to graphs may be a bit rusty, but it's something I clearly remember being rather interested in (should have taken more graph theory courses...).

Even more exciting is the fact that such databases existed. For those of us who understand (or at least know about) graphs, I think it's safe to say that we can all appreciate the representation of social networks, semantics, and other relationship-driven data, as graphs.

What hit me like a tonne of bricks (Lego or otherwise) was that graph-based databases have actually existed for some time. How long, exactly, I'm not 100% sure yet (as an example, one such database is neo4j which has been around since 2007).

The fact that highly-connected data could be so easily and directly represented in technology (along with the above) is what drove me to start digging deeper. Such data exists in abundance around the web (and elsewhere!).

So began my adventure into the realm of graph-based databases.

"What do you hope to accomplish with this blog? What are your goals?" Another astute question; one that I have anticipated to some degree:

As much as I hope to inform and educate, this blog is as much to serve as a record of my journey into graph-based databases. I'm hoping that as I continue to learn that others may hopefully glean some knowledge/insight from my posts (however little or much that may be).
As stated above (I'll repeat it for the sake of being explicit), I hope to educate and inform people as the world of graph-based databases expands and matures.
To explore graph-based databases and their related concepts. This may include information on other aspects of NoSQL, or even deeper dives into graph theory.
To give a base from which to derive a basic understanding of graphs and their databases.

I think that's about sufficient for now. Should I need to revisit these goals, I will do so when the time comes.

If anyone actually reads these posts, I greatly welcome feedback and comments. This blog may not be everyone's cup of tea, but, I'll take that chance. Let's try to keep the comments at least somewhat constructive (I'm sure some out there will pick apart things that I may get wrong, but I fully expect and welcome such criticism and corrections).

For the time being, I'll link to any sources from the web that I use. If I miss any or if people feel that I'm not citing enough, please let me know. It is not my intention to plagiarize anyone's work as I know developing it comes with no small effort; rather, it is my intention to aggregate and disseminate knowledge wherever possible.

Ok, I think this post is long-winded enough. The frequency with which I post remains to be seen. While I'm sure most won't exactly be waiting with bated breath, maybe they will!

With that, I look forward to posting more soon! Back to the grindstone!

(And Happy Valentine's Day to all of you out there!)