The Mad Grapher: concepts

I know it's been awhile since my last post, but, I assure you--one and all--that I have not left you kind folks for good. I can also assure you that the cause of my literary absence has everything to do with being loaded down with work and is not a result of my gallivanting around the world whilst trying strange and wonderful new beers.

(Boy, do I ever wish that was the case.)

Strange AND wonderful!
[Courtesy: FOX]

No, I've managed to find some time in which to write a bit about something I've been looking into concerning SDN: Paths with multiple entity types.

So, without further ado, let's begin.

Yes, quite.
[Courtesy: samuelrunge.com and Monty Python)

Disclaimer

While the topics I'm about to go into could very well have solutions to them that I have yet to uncover, I'm presenting my findings and questions in the hopes that not only will someone find the discussion interesting/useful, but, that it might also help lead me to better solutions.

So, please do read on and try to keep the caveat above in mind.

Retrieving Heterogeneous Paths

If you'll recall from several posts ago, I had been attempting to write a web application based around the concept of a simple recommendation engine and a software retailer (a la Babbage's from the last century). The implementation was being done using a Spring-based Java stack with Neo4j as the data store.

I had gotten to a point where I was able to load data into Neo4j via a method not unlike the one from Cineasts' (which can be found as part of the Spring Data Neo4j Reference). A quick recap of the domain model follows below.

The domain model in question.

One important task of any recommendation engine is the ability to suggest entities that are relevant to a given starting point via some sequence of relationships. Implementing such an engine, while quite doable, is something that I will eventually get to.

Another, more simple, task is to be able to see how two entities are related, i.e. "how do I get from point A to point B?". As an example, one such question might be, "How is game A related to game B, even though the difference in publication dates is large?" And yes, this type of question is extremely similar to the "Six Degrees of Separation" question.

I bet he's a gamer.
[Couresty: Wikipedia]

Some Basic Code

You might also recall that I had implemented a GameRepository class with the following signature:

 
public interface GameRepository extends GraphRepository<Game>, RelationshipOperationsRepository<Game>

With the above repository extending the GraphRepository and RelationshipOperationsRepository interfaces, we are provided with a host of cool (and handy) methods out of the box (tip: make sure you're comfortable with the "convention over configuration" paradigm, as there is a bit of magic that goes on with those OOTB methods).

As you'd expect, we can also add additional methods to the interface. One example of a method you might want to add could be a custom Cypher query that returns all Game nodes with a particular property (the implementation of this is outside the scope of this post, but, it's actually pretty simple; if people want to see it, just shoot me a note!).

However, today we're looking to address the "Six Degrees of Separation" question (minus the limit on the degrees of separation), i.e. "how are node A and node B related"?

So let's give this a (very simple) shot:

 
@Query("START n=node(1), x=node(97) MATCH p = shortestPath( n-[*]-x ) RETURN p")
Iterable<EntityPath<Game, Game> > getPath();

Given that the nodes with IDs 1 and 97 are "Game" nodes, the Cypher query above is essentially determining how the two nodes are related.

(For the sake of this post, I'm ignoring the fact that there could be multiple "shortest paths" between the two nodes as it has little bearing on the goal of this post.)

Quickly going over the return type, SDN allows us to return back an EntityPath given a starting node type and an ending node type which, in this case, is the Game type. An EntityPath is capable of storing nodes and their relationships as part of a Cypher path. The Iterable portion of the return type is necessary unless you want to use EndResult instead of Iterable.

We can then access the individual path(s) via the Iterable return type.

(NOTE: There is currently a bug with SDN that throws an exception when calling a single path's .nodes() or .nodeEntities(). This bug has been around since SDN 2.1.RC4.)

Traversing the Returned Path

There's a reason my explanation of the code used stops where it does. Those of you with a keen eye and/or are familiar with OOP/OOD will identify a potentially big stumbling block: How do you iterate through a path of nodes and/or relationships with potentially wholly disparate, unrelated types? Given that this is SDN and is geared towards integrating easily with Java and POJOs, the issue becomes apparent.

How do we solve this?

Solution 1

Make sure the nodes you are after all either implement a common interface or are derived from a common class.

Seem too good to be true? You're right, because it is. While there may be some scenarios in which this solution might work, it is often the case in a graph database the nodes/relationships are entirely unrelated concepts/types, e.g. Game and Customer, or, Game and Developer. This separation would imply that there are likely methods and attributes that are specific to a given type that would have no business being in a shared superclass or interface.

Solution 2

Employ some form of reflection.

There it is: The "r" word. Reflection is generally quite costly and so immediately handicaps its appeal.

A "poor man's" reflection might be to implement a series of "if/else" blocks to check types and perform some appropriate casting. I think we can see that this could and would become very ugly and difficult to maintain.

What about full-on automated reflection? Well, we run into a bit of a snag with that, too: In a typical assignment operation, we have a left-hand side (LHS) and a right-hand side (RHS), e.g. TheClass theClassInstance = new TheClass();. The RHS of the assignment is fairly straight forward with reflection. Since SDN persists a __type__ attribute/property into a given node/relationship, we can fetch it and use it for casting (since it's typically a fully-qualified type name). It might look something like this:

// n = Node object in question
String theType = (String)n.getProperty("__type__", "");

// we could then make the RHS of the assignment something like: (LHS) = template.convert(n, class<theType>;

But what about the LHS? Without an "if/else" block, how can we treat the returned string theType as a first-class citizen that would declare the type for the LHS? As far as I'm aware, there is no way to do this (of course there might be ways I've just not seen, but, I have a feeling they'd be just as expensive as the rest of the assignment). Java is a strongly-typed language, and so I'm sure most of us would expect this outcome.

So we see that this "solution" isn't really much of one.

Solution 3

Somehow modify the query to work with a @MapResult-annotated interface to deal with the results (note: At least as of SDN 2.3.1, @MapResult has been deprecated in favour of @QueryResult. I haven't done too much with @QueryResult so your mileage may vary). This obviously requires more knowledge ahead of time of the types of paths, nodes and relationships you're planning on returning which may limit the kinds of heterogeneous queries you can execute.

So What Else Can Be Done?

I recently attended GraphConnect 2013 in New York City where I had a chance to meet up with Michael Hunger (a name which should require little or no introduction in the graph database/Spring Data continuum). We had a great conversation about the very subject of this post.

His overall insights into Spring Data and its purpose, merits and detractors were very helpful, especially from a conceptual standpoint.

The number one point to take away--and perhaps it's quite obvious but it's worth reiterating--is this: Spring Data is not a magic bullet. Given the differences in concepts here (i.e. a strongly-typed, object-oriented language and a graph-based, schema-free data store), there is bound to be limitations.

Spring Data's strong point is ease of integration. A typical use case for SDN is likely to be one where relatively few nodes/relationships are needed to be returned. SDN is not meant to necessarily "explore" graphs.

In order to truly resolve such differences, it would seem to me to make more sense to either layer SDN on top of another layer of abstraction, or even to go direct to the Neo4j API.

Perhaps an even better approach would be to use a Domain-Specific Language (DSL) such as Groovy or JRuby; something that is much more loosely-typed, flexible, and still able to be integrated into the Java stack.

(Shameless plug: Check out Pacer, a powerful, JRuby-based graph traversal engine.)

Summary/Conclusion

In this post, we have seen that exploring subgraphs and paths with SDN is not as straightforward as we'd perhaps like; however, it is clear that SDN was not built to accomplish such features (at least not yet).

Spring Data for Neo4j's strong suit is ease of integration. As should be evident from this post and others, it is easy and straightforward to get SDN into existing/legacy Java applications, and to quickly stand-up Java-based applications that rely more on "end results" than "exploration" per se.

Ok, folks; as always, I definitely welcome comments,feedback, and questions. If you can think of a better way to approach this kind of problem space or even if I have something wrong here, please do let me know and I'll be sure to make good use of such feedback.

Thanks for reading, and we'll see you on the next post!

The purpose of this post is to give a common footing for those reading to understand what I mean when I talk about a "graph". The field of graph theory is a very deep and well-explored one. The real trick here is trying to give an introduction to graphs and graph theory without getting lost in too much detail. I will endeavour to keep this post to the utter, simplified basics of graphs.

(My apologies to anyone who already knows all about graph theory; this is meant more for those who might not have been exposed to graphs before, as well as to establish some common terms and nomenclature I'll be using in this blog. If some of the definitions seem too simplistic, please bear in mind I'm trying to just get the basics out there so anyone can understand the concepts.)

(And really, Wikipedia is a great resource for those wanting a slightly more in-depth and well-structured discussion on graphs and graph theory.)

When most people think "graphs", they think of a series of bars or lines depicting their Q4 growth projections. (Ok, most marketing and sales people I talk to do, anyway.)

We also think of graphs like the following (which is one of my personal favourites from our good friend Jay-Z):

(Source: http://mammon.typepad.com/photos/uncategorized/2007/09/13/99problems_graph.jpg)

Alas, these are not the graphs I'm referring to in this blog. The kind I'm talking about is a little bit different; though, at its core, perhaps we can take a graph to mean (at a very oversimplified level), "a graphical representation of data".

The kind of graph I'm referring to in this blog has to do with the representation of objects and their interconnections. Because I'm terrible at drawing (and lazy), I'll show a simple graph that illustrates what I'm saying and then try to describe what's going on.

This, my graph-fiends (yes, I just made that horrible pun up), is one example of a graph. What does this graph consist of? Here's what we have:

6 vertices or nodes (the circles), and,
7 edges or relationships (the lines between the circles).

At a first glance, that's all there is to it. In mathematics, we commonly use the terms "vertices" and "edges" (the vertices being the circles/points, the edges being the lines between); however, it's likely a bit more intuitive to most to use the terms "nodes" and "relationships" in place of "vertices" and "edges", respectively. I'll try to stick to using "nodes" and "relationships", though I may occasionally use the other terms interchangeably.

Is a graph still a graph if there are no edges involved? Absolutely. This is one example of a "null graph". (For those interested, at the other end of the spectrum, a "complete graph" is one whose vertices are all connected/adjacent to each other, for a total of n(n-1)/2 edges, where n is the number of vertices.)

Edges/relationships can be "undirected" (like the example above), or "directed" (like the example below). Directed edges have a source and a destination (this is useful for showing relationships between two objects). Undirected edges can also be used to show a bi-directional relationship (i.e. one that exists in both directions); for example, Tom likes Dana, and Dana likes Tom.

Still with me?

So, we know that a graph consists of a series of vertices or nodes that can be connected via edges or relationships. These edges can be undirected or directed.

At its core, that is what a graph is.

By now, if you haven't already, you're seeing how this might be useful to represent highly connected data. Social networks are a great example (think of how you could represent your "friendships" on Facebook or MySpace with a graph; this is actually called a "Likes" graph). Try to think of other information you could represent in graph fashion.

What about a family tree? For sure. As another matter of fact, a tree is actually a type of graph that is directed and acyclical (meaning that there are no loops, i.e. if you start following the directions on a graph, you travel each node exactly once).

Weight is another concept we come across in graph theory and graph databases. Think of weight as being how important or strong a relationship is between two nodes. For example, a higher weight can imply a stronger relationship between two nodes representing, say, two friends.

This is an important concept as we can use it to determine a shortest path or least costly path between two nodes. (See an explanation of paths further below.)

One slightly more complicated idea is that of degree, which represents the number of edges attached to a node. So, a node with a high degree will be one that has a lot of relationships attached to it. This can be very useful for determining just how popular, for example, something or someone is. (Cue jokes for this blog.)

One last concept to get out there before wrapping this article up is the concept of a walk or traversal. A traversal is taken to be the resultant path found by starting at a node X and following to some other node on the graph Y and the edges and nodes visited during the trip from X to Y. This concept becomes important as we start to look in to graph-based databases as traversals are part of what we can use to extract useful information from a graph (e.g. "Who are all of my cousins?") and to even predict and recommend a product for someone based on their browsing history.

Ok, I think that's enough math-like stuff for one post. The definitions given in this post don't even scratch the surface of what's out there in the world of graph theory. While I'm no serious graph theorist academically, I'm sure there are some standard texts out there. That said, as mentioned earlier, Wikipedia is really a great resource for exploring the world of graph theory a bit more.

I'll introduce other graph-related concepts as they become needed (e.g. searching or path-finding); however, if there are requests for discussing specific graphing concepts, I'd be happy to address them (provided I can speak intelligently to them, of course).

Now go weird your friends out with jokes that their family tree isn't really a tree as it has loops!

The Mad Grapher

Monday, 11 November 2013

Paths, Entities, and Types in Spring Data for Neo4j (also, I'm back)