The Mad Grapher: neo4j

Showing posts with label neo4j. Show all posts

Monday, 11 November 2013

Paths, Entities, and Types in Spring Data for Neo4j (also, I'm back)

I know it's been awhile since my last post, but, I assure you--one and all--that I have not left you kind folks for good. I can also assure you that the cause of my literary absence has everything to do with being loaded down with work and is not a result of my gallivanting around the world whilst trying strange and wonderful new beers.

(Boy, do I ever wish that was the case.)

Strange AND wonderful!
[Courtesy: FOX]

No, I've managed to find some time in which to write a bit about something I've been looking into concerning SDN: Paths with multiple entity types.

So, without further ado, let's begin.

Yes, quite.
[Courtesy: samuelrunge.com and Monty Python)

Disclaimer

While the topics I'm about to go into could very well have solutions to them that I have yet to uncover, I'm presenting my findings and questions in the hopes that not only will someone find the discussion interesting/useful, but, that it might also help lead me to better solutions.

So, please do read on and try to keep the caveat above in mind.

Retrieving Heterogeneous Paths

If you'll recall from several posts ago, I had been attempting to write a web application based around the concept of a simple recommendation engine and a software retailer (a la Babbage's from the last century). The implementation was being done using a Spring-based Java stack with Neo4j as the data store.

I had gotten to a point where I was able to load data into Neo4j via a method not unlike the one from Cineasts' (which can be found as part of the Spring Data Neo4j Reference). A quick recap of the domain model follows below.

The domain model in question.

One important task of any recommendation engine is the ability to suggest entities that are relevant to a given starting point via some sequence of relationships. Implementing such an engine, while quite doable, is something that I will eventually get to.

Another, more simple, task is to be able to see how two entities are related, i.e. "how do I get from point A to point B?". As an example, one such question might be, "How is game A related to game B, even though the difference in publication dates is large?" And yes, this type of question is extremely similar to the "Six Degrees of Separation" question.

I bet he's a gamer.
[Couresty: Wikipedia]

Some Basic Code

You might also recall that I had implemented a GameRepository class with the following signature:

 
public interface GameRepository extends GraphRepository<Game>, RelationshipOperationsRepository<Game>

With the above repository extending the GraphRepository and RelationshipOperationsRepository interfaces, we are provided with a host of cool (and handy) methods out of the box (tip: make sure you're comfortable with the "convention over configuration" paradigm, as there is a bit of magic that goes on with those OOTB methods).

As you'd expect, we can also add additional methods to the interface. One example of a method you might want to add could be a custom Cypher query that returns all Game nodes with a particular property (the implementation of this is outside the scope of this post, but, it's actually pretty simple; if people want to see it, just shoot me a note!).

However, today we're looking to address the "Six Degrees of Separation" question (minus the limit on the degrees of separation), i.e. "how are node A and node B related"?

So let's give this a (very simple) shot:

 
@Query("START n=node(1), x=node(97) MATCH p = shortestPath( n-[*]-x ) RETURN p")
Iterable<EntityPath<Game, Game> > getPath();

Given that the nodes with IDs 1 and 97 are "Game" nodes, the Cypher query above is essentially determining how the two nodes are related.

(For the sake of this post, I'm ignoring the fact that there could be multiple "shortest paths" between the two nodes as it has little bearing on the goal of this post.)

Quickly going over the return type, SDN allows us to return back an EntityPath given a starting node type and an ending node type which, in this case, is the Game type. An EntityPath is capable of storing nodes and their relationships as part of a Cypher path. The Iterable portion of the return type is necessary unless you want to use EndResult instead of Iterable.

We can then access the individual path(s) via the Iterable return type.

(NOTE: There is currently a bug with SDN that throws an exception when calling a single path's .nodes() or .nodeEntities(). This bug has been around since SDN 2.1.RC4.)

Traversing the Returned Path

There's a reason my explanation of the code used stops where it does. Those of you with a keen eye and/or are familiar with OOP/OOD will identify a potentially big stumbling block: How do you iterate through a path of nodes and/or relationships with potentially wholly disparate, unrelated types? Given that this is SDN and is geared towards integrating easily with Java and POJOs, the issue becomes apparent.

How do we solve this?

Solution 1

Make sure the nodes you are after all either implement a common interface or are derived from a common class.

Seem too good to be true? You're right, because it is. While there may be some scenarios in which this solution might work, it is often the case in a graph database the nodes/relationships are entirely unrelated concepts/types, e.g. Game and Customer, or, Game and Developer. This separation would imply that there are likely methods and attributes that are specific to a given type that would have no business being in a shared superclass or interface.

Solution 2

Employ some form of reflection.

There it is: The "r" word. Reflection is generally quite costly and so immediately handicaps its appeal.

A "poor man's" reflection might be to implement a series of "if/else" blocks to check types and perform some appropriate casting. I think we can see that this could and would become very ugly and difficult to maintain.

What about full-on automated reflection? Well, we run into a bit of a snag with that, too: In a typical assignment operation, we have a left-hand side (LHS) and a right-hand side (RHS), e.g. TheClass theClassInstance = new TheClass();. The RHS of the assignment is fairly straight forward with reflection. Since SDN persists a __type__ attribute/property into a given node/relationship, we can fetch it and use it for casting (since it's typically a fully-qualified type name). It might look something like this:

// n = Node object in question
String theType = (String)n.getProperty("__type__", "");

// we could then make the RHS of the assignment something like: (LHS) = template.convert(n, class<theType>;

But what about the LHS? Without an "if/else" block, how can we treat the returned string theType as a first-class citizen that would declare the type for the LHS? As far as I'm aware, there is no way to do this (of course there might be ways I've just not seen, but, I have a feeling they'd be just as expensive as the rest of the assignment). Java is a strongly-typed language, and so I'm sure most of us would expect this outcome.

So we see that this "solution" isn't really much of one.

Solution 3

Somehow modify the query to work with a @MapResult-annotated interface to deal with the results (note: At least as of SDN 2.3.1, @MapResult has been deprecated in favour of @QueryResult. I haven't done too much with @QueryResult so your mileage may vary). This obviously requires more knowledge ahead of time of the types of paths, nodes and relationships you're planning on returning which may limit the kinds of heterogeneous queries you can execute.

So What Else Can Be Done?

I recently attended GraphConnect 2013 in New York City where I had a chance to meet up with Michael Hunger (a name which should require little or no introduction in the graph database/Spring Data continuum). We had a great conversation about the very subject of this post.

His overall insights into Spring Data and its purpose, merits and detractors were very helpful, especially from a conceptual standpoint.

The number one point to take away--and perhaps it's quite obvious but it's worth reiterating--is this: Spring Data is not a magic bullet. Given the differences in concepts here (i.e. a strongly-typed, object-oriented language and a graph-based, schema-free data store), there is bound to be limitations.

Spring Data's strong point is ease of integration. A typical use case for SDN is likely to be one where relatively few nodes/relationships are needed to be returned. SDN is not meant to necessarily "explore" graphs.

In order to truly resolve such differences, it would seem to me to make more sense to either layer SDN on top of another layer of abstraction, or even to go direct to the Neo4j API.

Perhaps an even better approach would be to use a Domain-Specific Language (DSL) such as Groovy or JRuby; something that is much more loosely-typed, flexible, and still able to be integrated into the Java stack.

(Shameless plug: Check out Pacer, a powerful, JRuby-based graph traversal engine.)

Summary/Conclusion

In this post, we have seen that exploring subgraphs and paths with SDN is not as straightforward as we'd perhaps like; however, it is clear that SDN was not built to accomplish such features (at least not yet).

Spring Data for Neo4j's strong suit is ease of integration. As should be evident from this post and others, it is easy and straightforward to get SDN into existing/legacy Java applications, and to quickly stand-up Java-based applications that rely more on "end results" than "exploration" per se.

Ok, folks; as always, I definitely welcome comments,feedback, and questions. If you can think of a better way to approach this kind of problem space or even if I have something wrong here, please do let me know and I'll be sure to make good use of such feedback.

Thanks for reading, and we'll see you on the next post!

Friday, 25 October 2013

Spring Data Neo4j 2.3.1.RELEASE and Neo4j 1.9.4 Upgrading

I'm back with a quick post (with more to come soon).

I was in the middle of upgrading my little test project to a newer version of Spring Data Neo4j and Neo4j itself when I came across a few little points that others might find useful (though it should be noted that Neo4j is set to release 2.0 very soon and is currently doing milestone releases).

I upgraded SDN to 2.3.1.RELEASE and Neo4j (all aspects of it, including Cypher) to 1.9.4.

Here are a couple "gotchas" I encountered:

Dependencies

It would seem that CGLIB has been moved out of one of the Neo4j or SDN dependencies; however, I also found that--with the SDN/Neo4j combination I'm using--that a specific version is required, namely 2.2.2.

Adding this bit into my POM fixed things up nicely:

<groupId>cglib</groupId>

<artifactId>cglib</artifactId>

</dependency>

No bean named 'graphDatabaseService' is defined

This one was fun. As confirmed in this Neo4j forum thread (which actually uses Neo4j 1.7 and SDN 2.1.0.Build-Snapshot), when configuring Neo4j in an application context, the bean ID for the graph database service (whether it's embedded or from a server) must be "graphDatabaseService", similar to this:

<bean id="graphDatabaseService"

class="org.springframework.data.neo4j.rest.SpringRestGraphDatabase">

<constructor-arg index="0" value="http://someserver:7474/db/data" />

</bean>

If this little nuance is overlooked, you could very well see exceptions when, say, starting up your application server with your SDN-based application.

In my case, Maven compiled my WAR file just fine, but, starting up Tomcat produced a slew of exceptions.

It would seem that this is an ongoing issue, though, perhaps it's a necessary change from the SDN folks. We'll just have to see!

Hopefully some people find these tidbits useful.

We'll see you on the next post!

Friday, 28 September 2012

Spring Data Neo4j, @MapResult, Cypher, Casing and You!

A quick tip for those of you who are using Cypher with Spring Data Neo4j (SDN):

If you're using the @MapResult way in your Neo4j repositories, be careful of what you use in the corresponding @ResultColumn annotations.

For example, let's say you have the following repository definition (assume that you don't want to use the built-in findAll() method in this case; this example can be extended to other mapping results WLOG):

 
public interface MyRepository extends GraphRepository, RelationshipOperationsRepository {
 @Query("START n=node:__types__(className='com.yourorg.yourproject.entities.MyModel') RETURN COLLECT(n)")
 MyModelData getAllMyModels();
 
 @MapResult
 public interface MyModelData
 {
  @ResultColumn("COLLECT(n)")
  Iterable getMyModels(); 
 }
}

What you'd expect getAllMyModels() to do is to simply return all of those nodes that meet the Cypher criteria (note that the Cypher query above is very similar to what is generated by SDN in its findAll() method; the index referenced is, in fact, created and persisted by SDN).

However, if you call this code, you will get an error similar to the following:

"org.springframework.data.neo4j.support.conversion.NoSuchColumnFoundException: Expexted a column named COLLECT(n) to be in the result set."

You're probably scratching your head and asking yourself at this point, "But I am returning 'COLLECT(n)'! It's right there in the Cypher query!"

And you're completely right--it is there!

However, try running that same query in the Neo4j web console (wherever your graph database is residing).

Dig through the results, and you'll see that the column returned isn't, in fact, "COLLECT(n)", but, "collect(n)".

Yep, you got it! It's case sensitive!

So if you change

@ResultColumn("COLLECT(n)")

...to...

@ResultColumn("collect(n)")

...(note that casing of "collect") you'll be good as gold.

Remember that the next time you're diving deep into SDN.

And, as an update, I continue to work on the project I started back in June, and am finally making some headway. I'm coming across some interesting stuff, and I hope to share more in the coming weeks.

We'll see you on the next post!

Wednesday, 27 June 2012

Persisting to Neo4j via Spring Data (or, "Aren't We Persistent?")

Hi gang!

Ok, I'm back with a new post, this time with a post about a couple quirks I ran into while implementing some test cases for Spring Data using Neo4j. They're not bugs by any stretch; it's just new behaviour to get used to as you venture into the Spring Data world (which I'm loving, by the way).

During my copious amounts of downtime (that should be read while imagining me rolling my eyes so hard that I fall over backwards in my chair), I've been putting together a little playground for me to mess around with and play with Spring Data.

It's definitely evolving and changing as I change things up and try out new ideas, and I fully plan on sharing more about this in future posts.

For now, though, I'm just discussing a couple potential pitfalls newcomers to Spring Data might fall prey to (please pardon the prepositional phrase; I'm sure it won't be the last one).

Background
I've wanted to play with Spring Data a bit more seriously for some time now, so I started a few weeks ago and, I have to say, I'm loving every second of it. (I'm already a huge Spring fan, and the annotations continue to make my life easier.)

My sandbox goes something like this: After having gone through the docs for Spring Data (especially "Good Relationships"), I thought I'd try out something similar for myself, borrowing the whole "store" concept (as it seems to me to be the best, first choice for implementing a graph database).

Instead of using the whole "movie store" concept, I switched to something a little different to avoid total code reuse (I do borrow some code from the link above but modify it an awful lot).

For any geeks around my age (or older), you will remember a certain computer software retailer called Babbage's. I have fond memories of begging my parents to go into the store every time we passed one, which wasn't often (at least I don't think it was...). GameStop Corporation went on to purchase Babbage's (and EB Games, and a bunch of other software retailers), so you're unlikely to see a Babbage's by that name.

With that short trip down memory lane finished (more like memory cul de sac), I decided to model my domain after the concept of a software retailer. My store will cleverly enough be called Von Neumann's (any CS major and most geeks out there are currently groaning at that joke).

Domain Model
Currently, the domain model consists of the following:

Even looking at the UML diagram above, we can see that it's based on a graph model (can you pick out the entities and/or the relationship(s)?).

Test Cases' Setup
After setting up the relevant project (which I did as a Maven project), I set forth creating some tasks using JUnit. Before creating the actual test cases, I needed to make sure I had my testing context setup. I also needed away to ensure that any data being persisted was wiped clean after each run.

Fortunately, instead of having to create such functionality for my project, I learned that Neo4j already has a handy solution! The ImpermanentGraphDatabase. This little gem can be found in the Neo4j kernel. Specifically, I added these lines to my POM (you can see the specific version I'm using, too):

<dependency>
<groupId>org.neo4j</groupId>
<artifactId>neo4j-kernel</artifactId>
<version>1.8.M03</version>
</dependency>

...and then adding the following line to my testing context:

<bean id="graphDBService" class="org.neo4j.test.ImpermanentGraphDatabase" destroy-method="shutdown"/>

And presto! A suitable testing graph database for my test cases!

(Warning: I am using the latest version that I found worked best for me and is compatible with all my other dependencies. ImpermanentGraphDatabase as available in earlier versions of Neo4j, as well.)

It should also be noted that I make use of both the Neo4j repositories interfaces AND the Neo4jOperations class for persisting and retrieval.

Another note is that I've made the entire test class @Transactional.

Test Cases Proper
I'll list two of them below and a couple of the quirks I noticed.

Ensuring a Customer Can Make a Purchase
This test case consists of creating a Customer object, a couple Game objects, and making sure that Purchases can be created, persisted and retrieved (along with the associated entities).

 
 @Test
 public void customerCanMakePurchases()
 {
  // setup our game constants
  final int QTY = 1;
  final String GAME_TITLE = "Space Weasel 3.5";
  final String GAME_TITLE_2 = "The Space Testing Game";
  final String GAME_DESC = "Rodent fun in space!";
  final String GAME_DESC_2 = "Tests in space!";
  final int STOCK_QTY = 10;
  final float PRICE = 59.99f;
  
  // setup our customer constants
  final String FIRST_NAME = "Edgar";
  final String LAST_NAME = "Neubauer";
  
  // create our customer for this test
  Customer customer1 = new Customer();
  
  // set the customer's properties (NOTE: "firstName" is an indexed property in the Customer entity, but "lastName" is not!)
  customer1.setFirstName(FIRST_NAME);
  customer1.setLastName(LAST_NAME);

  // create our games for this test
  Stock game1 = new Game(GAME_TITLE, GAME_DESC, STOCK_QTY, PRICE);
  Stock game2 = new Game(GAME_TITLE_2, GAME_DESC_2, STOCK_QTY + 5, PRICE + 5);

First, do the setup. (And, for the sake of brevity, I'm leaving out the annotated entities.)

Nothing strange going on here--just creating two games and a single customer. It is worth noting (for later on) that "firstName" is an indexed property of the Customer entity/node. This means that it is searchable (also recall that Neo4j's default indexing engine is Lucene).

It had to be done.

The games we've chosen are clearly AAA-title games. These tests should be interesting.

  // save entities BEFORE saving the relationships!
  template.save(game1);
  template.save(game2);
  template.save(customer1);

  // make those purchases! Support our test economy!
  // (NOTE: "makePurchase" actually uses the "template" parameter to persist the relationship, so no need to do it again)
  Purchase p1 = customer1.makePurchase(template, game1, QTY);  
  Purchase p2 = customer1.makePurchase(template, game2, QTY);

Above, we make sure to persist the 2 games and single customer. We also do this prior to persisting any relationships. This is necessary. In this case, I make use of an instance variable called "template" which is actually an instance of Neo4jOperations. This is one way of accessing the necessary persistence/retrieval functionality we need.

We then create 2 Purchase objects/relationships (Purchase is actually a relationship entity). It is also worth noting that, instead of using the "template" object to persist the relationships, I've followed the Neo4j tutorial book "Good Relationships" and attempted another way of doing persistence, i.e. by passing the "template" object into the necessary method and having the method (in this case makePurchase) actually do the persisting of the newly-created Purchase.

Again, both "game1" and "game2" need to be persisted prior to persisting any relationships between them.

Still with me?

  // retrieve the customer
  Customer customer1Found = this.customerRepository.findByPropertyValue("firstName", FIRST_NAME);
  
  //
  // Tests
  //
  
  // can we find/retrieve the customer?
  assertNotNull("Unable to find customer.", customer1Found);
  
  // can we find the specific customer for which we are looking?
  assertEquals("Returned customer but not the one searched for.", FIRST_NAME, customer1Found.getFirstName());
  
  // does the retrieved customer have its non-indexed properties returned, as well?
  assertEquals("Returned customer doesn't have non-indexed properties returned.", LAST_NAME, customer1Found.getLastName());

  // retrieve the customer's purchases
  // (NOTE: We case as a Collection just to make checking the number of puchases easier)  
  Iterable purchasesIt = customer1Found.getPurchases();
  Collection purchases = IteratorUtil.asCollection(purchasesIt);
  
  // do we have the correct number of purchases?
  assertEquals("Number of purchases do not match.", 2, purchases.size());

So now we get to some actual testing.

The tests above are all straightforward. We ensure the following:

We can retrieve a persisted node, specifically via an indexed property.
We can retrieve the correct persisted node for which we are searching.
We can view non-indexed properties from the retrieved node.
We can retrieve the correct number of relationships of the retrieved node.

As noted in Section 9.3 of "Good Relationships", we use Iterable for those node properties that are collections and are to be left as read-only, and Collection or Set for those collections that can be modified.

  // go through the actual purchases...
  Iterator purchIt = purchasesIt.iterator();
  Purchase purchase1 = purchIt.next();
  
  // retrieving objects via Spring Data pulls lazily by default; for eager mapping, use @Fetch (but be forewarned!)
  // ...this means we have to use the fetch() method to finish loading related objects
  Stock s1 = template.fetch(purchase1.getItem());

What if we want to view a node's related nodes' data?

By default, Spring Data loads an entity's relationships lazily, which makes perfect sense (just picture how much memory would be needed if you had a very large, highly connected graph). Also remember that there are implicit relationships between entities if an entity is contained as a property of another entity.

(Courtesy of Paramount Picture's Forrest Gump)
"Mama said eager loading is like a box of chocolates: You never know what you're gonna get."
Well, at least the chocolates had an easily-determined, finite number in the box...

It is possible to have an eager retrieval by using the @Fetch annotation (be warned, though, that it will currently only work, by default, on node entities and collections of relationships that are based on Collection, Set, or Iterable; Spring Data may expand that in later releases, but I believe you can extend the mappings to work with other classes, if you so desire).

So, with our lazily-loaded relationships, we can use "template"'s fetch method to finish loading in the missing data. It's as simple as that! Anyone familiar with ORM will get this immediately.

  // can we retrieve our first purchase successfully w/ its details?
  assertEquals("Purchased item not persisted properly.", GAME_TITLE, s1.getTitle());

  purchase1 = purchIt.next();  
  Stock s2 = template.fetch(purchase1.getItem());
  
  // can we retrieve our second purchase successfully w/ its details?
  assertEquals("Purchased item not persisted properly.", GAME_TITLE_2, s2.getTitle());
  
  // if we're here, then all test ran succesfully.  Hooray!
 }

Above, we run a couple more tests to ensure that we can, in fact, retrieve and view lazily-loaded objects from Neo4j.

Nothing to it!

Making Friends the Easy Way: By Creating Them!
For these tests, we're going to have a look at something a bit more social, i.e. customers befriending other customers (how Utopian!). I suppose we could make them "rivals" or "enemies", but that's a bit too sinister for this blog (for now...).

Anyway, I have prior to this test method a setup method (annotated with the @Before JUnit annotation) that creates 5 customers (if you're interested, I persist them using a CustomerRespoitory I created by extending the GraphRepository and RelationshipOperationsRepository interfaces).

 @Test
 public void customerFriends()
 {
  // add friends
  c1.addFriend(c2);
  c1.addFriend(c3);
  c1.addFriend(c4);
  c1.addFriend(c5);

  // be careful! setting a "Direction.BOTH" relationship in one node entity will have the ENTIRE relationship saved (*including the adjoining node*) when saving just ONE of the two entities!
  // ...if you save both, Neo4j will remove the duplication (and you'll be left wondering why c1 is a friend of c2, but not vice versa)
  
  // save c1's friends
  customerRepository.save(c1);

In the code above, we have the customer "c1" make friends with the other customers (he's a social butterfly).

Now, perhaps the most important part of this whole blog is shown here (and below). It has to do with relationships, specifically those that are annotated as being "Direction.BOTH".

As you can see from the comments in the code above, we need to be careful about how we create relationships between nodes and save them. If we were to create the relationship between, say "c1" and "c2", and then persist each node (and therefore the relationships, which the customer repository will handle), we would notice that the relationships have gone awry, and that the duplicate relationship from "c2" has been removed.

So, what we're going to do is the following (keeping in mind that "c1" and "c2" have already been persisted in the setup method):

Persist "c1" (and thereby its friendship to "c2").
Retrieve "c2".
Add any other friends' relationships to the retrieved "c2" (while not befriending back to "c1").
Persist "c2" (and thereby its friendships to those added in Step 3).

Step 1 is done above.

  // we can't just continue to add friends to this.c2, as once we try to save this.c2, it'll remove the duplicate relationship between c1 and c2.
  // ...so, to get around this, we retrieve the persisted object from the DB
  Customer c2Found = customerRepository.findByPropertyValue("lastName", C2_LNAME);
  c2Found.addFriend(c3);
  c2Found.addFriend(c4);
  c2Found.addFriend(c5);

  // save c2's friends, which will preserve the existing relationship with c1! Old friends can remain friends!
  customerRepository.save(c2Found);

As you can see above, we finish the remaining steps (2 through 4).

Again, note that we DO NOT create a reciprocal relationship from "c2" to "c1". (One would hope a friendship relationship would be reciprocal; unless you have stalkers or something...)

This would totally help.

All that's left now is to run some tests to ensure that our friends have remained friends throughout all this persisting!

  // retrieve c1 for some tests
  Customer c1Found = customerRepository.findByPropertyValue("lastName", C1_LNAME);
  
  Iterable c1Friends = c1Found.getFriends();
  Collection c1FriendsSet = IteratorUtil.asCollection(c1Friends);
  Iterator custIt = c1Friends.iterator();
  
  int numFriends = 0;
  
  // let's make sure all of c1's friends were retrieved
  assertTrue("Friend not found.", c1FriendsSet.containsAll(IteratorUtil.asCollection(c1.getFriends())));
  
  // let's also make sure that c1 and c2 are still buds specifically (these two are inseparable...you should see them at ComicCon!)
  assertTrue("Friend not found.", c1FriendsSet.contains(c2));
  
  // let's make sure the exact number of friends returned is correct 
  while (custIt.hasNext())
  {   
   custIt.next();
   numFriends++;
  } // while
  
  assertEquals("Number of friends returned incorrect.", 4, numFriends);

  // if we're here, all is well! Huzzah!
}

Above, as in the first test, we make sure that all of the friendships have been properly preserved, both from "c1"'s and "c2"'s perspective.

Conclusion
In this post, we have seen the basics of persisting with Spring Data and a couple of the quirks I ran into. These are documented within the Spring Data documentation, but it never hurts to bring these little nuances out into the light even further.

We also saw that the ImpermanentGraphDatabase is available to us through the Neo4j kernel which is a wonderful tool for implementing test cases with quick setup and teardown--no needing to write initializers and cleaners for a Neo4j installation!

So there we have it! A first pass through persisting with Spring Data and implementing some unit tests using Neo4j and JUnit.

If anyone has any questions or I've made a mistake, please feel free to leave feedback.

We'll see you on the next post!

Monday, 27 February 2012

Fun with Gremlin (not related to the movie or the car))

So you have all of this graph-tacular data in your graph database (for this post, I'm using neo4j). It looks slick with its vertices and edges. People stop by your desk to say, "Is that the new connect-the-dots app you're working on?"

After staring at them for a couple moments and repressing the urge to sell them the "app" for five bucks, you start thinking about how you're going to access and use this wonderful data.

"If only there was a way to query the data..." you wonder.

While SQL is the querying language of choice for relational databases, there is no real "standard" as far as NoSQL databases go (that's a subject for another time).

In the world of graph-driven databases, there are options:

Gremlin; a Groovy-based querying language that can handle any type of query. This language is perhaps geared more towards those with more of a math- or graph-based background as the syntax is nothing like SQL.
SPARQL; a popular query language for RDF graphs. This language is likely more easily picked up by those with an SQL background as the syntax is more SQL-like than Gremlin.
Cypher; a neo4j-specific querying language. This language currently only allows read-only queries of graphs (i.e. no inserts, updates or deletions). Like SPARQL, Cypher also takes its cues from SQL.

Neo4j folks: At the time of this writing, neo4j comes pre-loaded with plugins for both Gremlin and Cypher. (As I understand things, there is currently a ticket open in the neo4j community to develop a SPARQL plugin, but has not yet been completed; there are likely complications stemming from the fact that neo4j is not an RDF database at its core.)

In this article, I'm going to cover some basics in using Gremlin. By using some concrete examples, I hope to demonstrate a bit of the power behind using a graph-based database!

For the purposes of this post, I'll be making use of a small graph I created that contains some people, who they know, and what they've purchased. I expect this graph to grow and change as time goes on, but, that's where it stands for now (this is the beauty of schema-less data).

So far, Gremlin is the only querying language I've used with graph databases, hence this article making use of Gremlin. The good thing about this Gremlin is that it won't require a new muffler and you don't have to worry about feeding it after midnight.

Getting started with Gremlin and Neo4j is easy enough. It's a plugin that comes with Neo4j, so all you need to do to begin is to open up your Neo4j web admin instance, click the "Console" menu option at the top, and select the "Gremlin" option from the top-right of the console that appears.

At this point, you're faced with the currently-available variables and a gremlin overlooking them, like so:

We see that the variable g contains the current graph. If we issue the query "g.V" (without the quotes, as always), we get a list of all the vertices (nodes) in the graph; however, this information is not incredibly useful as you're only given each node's ID.

Let's say some (but not all) of our nodes have been given the property "Name". If we try using "g.V.Name", we'll again see a listing of the nodes; however, the value of the "Name" property for each node (if available) will appear (if "Name" isn't a property of a specific node, "null" will appear; also, note that Gremlin is case-sensitive).

We can also similarly view a list of the edges (relationships) by issuing "g.E" to the console; this time, however, in addition to the edge IDs, we also see the type of relationship and the adjoining vertices (nodes), e.g. 1-KNOWS->6. Note that you can see that these edges are directed! In this case, we see that the edge goes out from node 1 and goes in to node 6. Useful stuff.

We can also (similarly) view a property of the edges (if it exists) as we did for the vertices, e.g. "g.E.Quantity".

Identifying individual vertices and edges is simply a matter of knowing each one's ID number. Obtaining a reference to a node (which, yes, can be assigned to variables for easier reference) involves a call like "g.v(6)" or "g.e(3)" (note the casing).

You can examine the value of an individual node's/relationship's specific property by querying it much like we did above, e.g. "g.v(6).Name".

Want to know all of the edges coming out of a node? "g.v(6).outE" will do the trick. Similarly, if you want to know all of the edges coming in to a node, we can use "g.v(6).inE".

We can also go one step further and use the "inV" and "outV" steps to identify the nodes on the ends of edge. "inV" will correspond to the node at the head of an edge (also known as the "incoming vertex"), whereas "outV" will correspond to the other side of an edge (an "outgoing vertex").

You can also use "bothV" and "bothE" to get both incoming and outoing vertices and edges (respectively).

So if you want to travel from one node to another, you might do something like: v(1).outE.inV. You can shorten this by using: v(1).out. There exist similar constructs for "in" and "both". (Note that for "in", it's a short-cut for "inE.outV".)

Have more than one relationship connecting a node? No problem! You can access specific ones via something like this: v(1).out('LIKES') (this will take you to the node on the other end of the 'LIKES' relationship for node v(1)).

(Gremlin's github page has a good basic tutorial about all this.)

I could go on at great length about the features of Gremlin, but I think this is a great starting point. I'm going to include some concrete examples below. If I use any constructs or syntax that doesn't make sense, I very much encourage you to visit the Gremlin wiki page to look up the answers; this is a good little exercise, especially for the "groupCount" and "cap" constructs.

The graph below assumes one that has nodes describing products and people, and has relationships showing purchases and who knows whom.

How many times has each product been purchased?

g.V.inE('PURCHASED').inV.ProductType.groupCount.cap

Which products have been purchased more than once?

g.V.filter{it.inE('PURCHASED').count() > 1}.ProductType

Who is known by more than one person in this graph (we define 'knowing' as sharing an edge/relationship--in or out--with another person)?

g.V.filter{it.bothE('KNOWS').count() > 1}.Name

Who knows the most people, and how many people do they know?

g.V.bothE('KNOWS').outV.Name.groupCount.cap

How well-known is each person?

g.V.both('KNOWS').Name.groupCount.cap

(By the way, if anyone notices anything wrong with anything above, please let me know as I'm always looking to evolve and develop my knowledge of, well, everything.)

So you begin to see the power of what we can extract out of a graph database! Personally, I'm tempted to find out "who is your daddy and what does he do?" Such a question would be relatively straight-forward to figure out!

Ok, I think that's enough for now. Hopefully this is a decent (but brief) introduction to the world of Gremlin and querying graph databases. I know writing this has forced me to examine in more depth exactly what exactly these queries actually do.

I'll see you next time!

Wednesday, 15 February 2012

On the Subject of NoSQL (and a bit about graph databases)

Pretty formal-sounding title, yeah?

(I'm likely just suffering from title-writers' block.)

So, before I dive head-long into an actual graph database, it's probably a good idea to briefly discuss what makes a graph database a graph database.

RDBMS

For the past 30+ years, the world of databases has been primarily dominated by the colossus that is relational databases (i.e. RDBMS, or Relational DataBase Management System). RDBMSes are well-known and well-studied, but it suffices to say that they can be used to model almost any kind of information (that is, RDBMS can be used to model a broad, general set of data).

I'm going to assume that the reader is familiar with basic RDBMS concepts like columns, rows and tables. Information in an RDBMS is grouped into similar entities that can have relationships between them. In this way, we can model just about any situation in this single kind of database. For example, you can model everything from a school (teachers, classrooms, students, schedules, etc.) to an online business (products, orders, inventory, etc.).

Because each table consists of rows and each row is made up of columns (columns representing the types of information you want to store), we know what kind of data to expect in each table and database based on its schema. This is great for ensuring that you don't try to save a product's name when it's expecting the product's price.

RDBMSes are also good for both looking up information as well as handling the storage of data. The concept of transactions is important; just as businesses need to deal with transactions every day, so to does a database that is used to enter in data of a transactional nature (e.g. payments, orders).

Another important feature of any RDBMS is the ability to query the data. SQL (Structured Querying Language) is almost as old as RDBMSes themselves and is a powerful tool for looking up and modifying data in an RDBMS. It might not be the most efficient tool on its own given the potential complexities of how a specific database is laid out (see: query analyzers and optimizers), but it can be tuned to be a very powerful tool with the right indices setup and the right query. (SQL can also be prepared/compiled in some cases, but we won't get into that.)

So why use one of these so-called NoSQL databases?

NoSQL

Without going into too much detail about NoSQL (that's another post or ten on its own), NoSQL is really better off being called NoREL (i.e. non-relational model, though that's more in the traditional RDBMS sense).

Whereas RDBMSes are geared towards being a general solution for almost any model, it does everything fairly well but requires specialized tuning in order to make it run really well, and it comes at the expense of some areas being better and others becoming "worse" as a result. For example, it's difficult to make a database that is tuned for efficiently handling a high volume of transactions be efficient at also handling speedy look-ups and reads for data.

NoSQL databases provide specializations that RDBMS systems can't provide. They come out of the box ready to be super-good at one or two areas, but not to so great in others. Think of them as pre-tuned databases, and one size definitely does not fit all.

NoSQL databases range in types from document-stores (e.g. CouchDB) to key-value stores (e.g. MongoDB) to graph databases (e.g. Neo4j), to name but a few types. (A decent breakdown of NoSQL database types can be found here and here.)

You could likely tune an RDBMS to be quite good at a number of things, but it's a bit of a pain (trying to setup and tune proper indices is a painful and involved process).

This is partly why I disagree wholly with those who say NoSQL heralds the death of RDBMS. Quite the contrary: I see NoSQL databases as being an excellent complement to RDBMS.

Another advantage of NoSQL databases is the fact that most of them are schema-less; that is, they can store arbitrary information without the need to structure it. This can be very powerful when modelling heterogeneous information in a database. It can also allow for the evolution of your data models without the need to completely overhaul and change your database (anyone who's tried to do that before with an in-production database knows just how painful that is).

NoSQL databases also tend to scale very, very well, often times easier than clustering together an RDBMS-based solution.

One point worth mentioning is the fact that RDBMSes are traditionally known as what is called ACID compliant. ACID (which stands for Atomicity, Consistency, Integrity, Durability) is a very important concept for databases that handle transactional information (and most businesses do). A discussion on ACID is well outside the scope of this post, but, ACID compliancy is commonly lacking in most NoSQL databases (most of them subscribe to the principle of Eventual Consistency). This is definitely worth noting and keeping in mind when choosing a NoSQL database to use. A great example of an exception to this is the fact that Neo4j (a graph database) is, in fact, ACID compliant.

(If anyone wants an actual discussion on ACID vs. Eventual Consistency and why it's important, let me know and I'll see about putting an article together.)

Graph Databases

As far as graph databases are concerned (took me long enough to get here), I strongly suggest going to this link and checking out "What is a graph database?" and "Comparing Neo4j" tabs on the page. It does a wonderful job of explaining what a graph database actually is (big surprise) and how graph databases relate to other types of databases (both NoSQL and RDBMS).

While you do need to typically index nodes and relationships for searching (e.g. full text searches over properties), strictly speaking a graph database is one that provides "index-free adjancency" (source: http://en.wikipedia.org/wiki/Graph_database). This means that each element (i.e. node) has a link to its adjacent elements to follow--no index look-ups are necessary.

Graph databases are a great way of representing graphs (remember those things with nodes and relationships?). Graphs in a graph database typically consist of arbitrary nodes and relationships. Each node and relationship can have assigned to it an arbitrary number of properties. Properties are simply key-value pairs of information (e.g. "Name" = "Joe" and "Age" = 30).

So you can easily represent a family tree, a network diagram, or even a social network with a graph database. For example, think of two nodes, each one representing a friend at work (call them Jason and Scott), and a relationship between them (representing their friendship). So, each node would have properties like, "Name"="Jason", "Age"=30, etc. The relationship between them could be labelled "KNOWS", and that relationship could have properties like "Since"="01/01/2001" and "At"="Acme Inc.". All of a sudden, we now have a way to track friends, find out who knows whom, when they met, and where they met.

Now let's say somewhere down the line we learn something else about each person; say, Jason's birthday. It's very easy to add a new property to Jason's node.

We begin to see the power of graph databases very quickly.

We can use graph databases in many ways, including (but definitely not limited to):

Recommend products to buy based on a user's purchase history (follow a graph from a product someone has bought back through another user that's bought the same product and then on to another product that other user has also purchased).
Find out just how popular someone is (look at the number of relationships that person's node has).
See what geographic locations have the most users in it.
Find out if you know the CEO at a powerful company through a friend (you can always use more friends!).

This is why sites like Amazon and LinkedIn are so powerful. Think about how they might use a graph database.

Ok, that's enough for now. I keep thinking I can write short posts on this stuff, but, there's just so much!

As always, I'm sure there's plenty more I haven't covered, but if you'd like to see anything else put up here (or clarified), just let me know. I'm always looking for ways to improve how I organize and present this information.

Peas!

Neo4j: First Blood

That's one serious-sounding title.

This post is about my first foray into the graph database world. I chose as my first victim/offering Neo4j. I'll likely end up writing more than a few articles about this particular database, but, I thought I'd start with the basics, including the following:

Download
Installation
Configuration
Poking around

(And yes, "poking around" is a sanctioned technical term.)

So, a bit about Neo4j to start!

It's been around since 2007.
NUMBER ONE SELLING POINT FOR ME: It's ACID compliant! Not too many NoSQL engines that I've seen (yet) are, though for various reasons that are well outside the scope of this post.
As you may have guessed, it's primarily meant for integration with Java.
It can also be integrated with Spring (a big plus, if you ask me) via Spring Data (POJO development FTW!).
Its API is REST-based, and so can be utilized by just about any platform (though you'll likely have to write your own wrapper, unless you can find one out there in the open source world).
There are some ready-made wrappers for some platforms available, such as Python and Ruby.
It's available for Windows, MacOS and Linux-based OSes.
It's available in both 32-bit and 64-bit for Windows and Linux.
It's available in 3 versions, including the Community version, the Advanced version, and the Enterprise version. As you'd expect, the Community version is open source available under GPL (the other versions are covered under AGPL).
It comes ready-to-run with a version of the web/app server Jetty.
It comes with a built-in web admin console (hence the need for Jetty).
Following in the NoSQL tradition, it scales very well for Big Data.\
Its name lends itself well to any number of The Matrix jokes.

I strongly suggest going to their website (www.neo4j.org) to do a little research of your own.

Download

Given that I'm just looking to get my feet wet with Neo4j, I downloaded the Neo4j v1.6 Community Edition 64-bit Linux package from my ready-made VM (coincidentally named Morpheus) running CentOS (sorry Windows users). Read: I can't be bothered downloading the source and compiling it. Note that Java 1.6+ is required; a complete set of requirements can be found here.

The archive is only about 37MB (give or take) and so completed relatively quickly over my bonded DSL connection.

Installation

After un-tarring the package and moving it into an appropriate directory (I'm a sucker for /etc) and starting the Neo4j server from the command line via bin/neo4j start (don't worry; there's a README.txt in the root of the installation directory that has all the quickstart instructions in it), I was ready to rock!

(I should note that I did get a couple warnings, shown below, but it hasn't seemed to have affected anything just yet, likely given how small my current graph is.

WARNING: Detected a limit of 1024 for maximum open files, while a minimum value of 40000 is recommended.

WARNING: Problems with the operation of the server may occur. Please refer to the Neo4j manual regarding lifting this limitation.

)

Or so I thought.

Configuration

If there's one bone of contention I have with Neo4j, it's that finding the appropriate (and up-to-date) documentation for the config files takes a bit of digging (it's not impossible by any stretch, though).

As I quickly found out, trying to access the web admin console that comes with Neo4j (very handy, I must say) outside of localhost is a non-starter out of the box.

Did I pack it in for the day and go back to flipping through Steam for cheap games? No! I did some "research".

Here's the solution: In order to get Neo4j's web admin to work from somewhere outside of localhost, change the neo4j-server.properties in the install directory's /conf directory (go figure).

Commented out towards the top of the file is the property org.neo4j.server.webserver.adddress". Uncomment it and change it to the IP you want to bind the server to (the property does note that there are security concerns to consider, so, you may want to consult the Neo4j documentation before doing this).

You can also change other settings in this file, e.g. getting it to work over HTTPS, changing the default ports for each, etc.

(Note: The web admin defaults to running over HTTP on port 7474 and over HTTPS over 7473.)

So, after making the change to the appropriate IP and restarting the Neo4j server, I tried pointing my browser back at the Neo4j web admin.

Success!

Poking Around

Without going in to too much detail (I'll likely do that in subsequent posts), the Neo4j web admin has 5 distinct sections to help manage your installation:

Dashboard
Data browser
Console
Server info
Index manager

Each one is fairly self explanatory.

The dashboard provides at-a-glance information about your server over a specified time line, such as the total number of nodes, properties, relationships and relationship types.

The data browser allows you to perform basic CRUD operations via a GUI. You can also perform look-ups (consult the Help icon immediately to the right of the search button for more details on exactly what you can search for). In other words, you can create a graph right then and there.

You can also flip the view to a graphical representation of the current graph and manipulate it (via click-and-drag) directly. This is perhaps the coolest part of the web admin console (hey, we all like cool features!).

That's some serious badassery right there.

Next we have the console. This is a great way to get familiar with the languages used to query Neo4j, including HTTP (i.e. accessing the REST calls), Gremlin (a Groovy-based querying language becoming common across multiple graph databases; it seems to be mainly for those coming from a math/graph background), and Cypher (Neo4j's own querying language; it seems to be mainly for those coming more from an SQL background).

A quick note: At the time of this writing, Cypher only allows for read-only queries, whereas Gremlin allows for both reading and writing.

Next up, server info. This is just a way to view (read: read-only) the server's configuration information. No biggie.

Finally, we have the index manager. Now, this is something I'm sure I'll be getting into a lot more as time goes on. It's worth noting that Neo4j is built using the Lucene project for indexing, so this is very promising (especially for those familiar with Lucene and/or Solr). This makes a great deal of sense given the concept of properties for each node (full text search is going to be very important).

Regardless, you can create and manage indices for both nodes and relationships here.

So there we have it: My first venture into graph databases. I'll admit I picked Neo4j first based on my initial research into graph databases. It does seem to be the most popular graph database at the moment, so I look forward to seeing what it can do.

In subsequent posts, I'll be monkeying around with the querying languages, creating and modifying graphs, messing around with indices, and all kinds of other good stuff.

I hope some of you out there found this somewhat useful/informative/cool. Well, I know I found it cool, but then I always was kind of odd...

Until next time, when we'll go Graph to the Future! (Sorry, couldn't go an entire post without making at least one movie-based pun.)