So I recently tweeted about the new benchmarking results I was generating for dotNetRDF which showed a massive improvement in performance and linear scaling. Gregory Williams asked about whether I was going to do a write up so here it is.
Firstly some background on how the SPARQL engine in dotNetRDF works. For a start the engine is entirely in-memory and it is relatively basic in its design. The data structures used are essentially a collection of graphs, where each graph is a collection of triples, each graph maintains its own indexes over those triples and allows us to look up triples using those indexes.
So when I say the data structures are simple I mean it, there are no Node IDs, no packing of values into Nodes beyond the simple string/URI values that represent the RDF terms in fact very little cleverness at all. This of course poses some problems in building a scalable engine because when we do joins we are required to do full term equality checks to determine if two solutions are compatible - there are no quick integer comparisons for us.
This has meant that historically our join algorithm has been naive and slow, it has simply had to iterate over the two sides of the join in a doubly nested loop and check each possible pairing to see if they are compatible. Clearly this is naive and this is an O(n2) algorithm hence slow to start with and exponential scaling. While some effort has been expended in the past on various algorithms ultimately it has never been the highest of our priorities because there have been plenty of other more pressing things to implement in the library.
Ultimately sometimes you have to admit defeat and say that clearly this code sucks so how do we make it better? This is where Sprinkle SPARQL comes in. Now this algorithm was actually designed to run on a massively multi-threaded architecture where you have vast amounts of globally shared memory which you can happily use - in essence it allows you to trade space for time. If you go and read the paper the obvious problem you hit is that if you want to apply this algorithm to a single threaded engine with limited memory you don't have the memory to go around allocating the massive arrays the algorithm uses.
Therefore what we actually do is to take the basic principle behind the algorithm and implement it in a way that makes sense to a single threaded architecture. The algorithm we came up with can be summarized as follows:
- Determine what the joinable variables are (variables shared between LHS and RHS)
- Make a pass over the results from the LHS of the join and for each joinable variable build a Hash Table mapping each unique value for that variable to all the intermediate solutions that contain it
- Make a pass over the results from the RHS of the join using the joinable variables to look up potential matches.
- The actual set of potential matches is the intersection of the possible matches for each joinable variable, as intermediate solution is identified by a unique integer this is a fast intersection of lists of integers.
- For each potential match we do the full equality check to see if the solutions are actually compatible (since we have to allow for hash code collisions)
I suspect that there are quite a few other stores which implement something very similar and having seen the results of this I wish I'd implemented it a long time ago!
The advantage of this algorithm for us is that it means we actually have a much smaller number of full term equality comparisons to perform because most of the join logic now becomes hash table lookups and integer intersections which are far faster. Also this algorithm is O(2n) which means it is linear with respect to the size of the intermediate results rather than exponential.
When you compare the two algorithms graphically as in the following graph comparing the performance of the current public release against the upcoming 0.6.0 release which incorporates this brand new algorithm you can see just how much of a difference in performance this makes:
So the above is the results of the Berlin SPARQL Benchmark v3.1 run on a variety of dataset from 5,000 triples to 1M triples which while relatively small by some standards is about as far as the engine has previously been able to scale. This shows total runtime in seconds for 50 runs of the algorithm versus the number of triples in the dataset.
The graph is pretty damning, the old naive algorithm scales terribly while the new algorithm scales linearly making for one very pleasing line on the graph from my point of view. This also suggests that the new algorithm will allow us to scale the engine to much larger numbers of triples than we previously could since we'll now be limited more by how many triples we can load into memory rather than the speed of our joins.
My takeaway from this is sometimes you really just have to throw out your code and re-think it because when you do you can sometimes make fantastic improvements just like this! 10/02/2012 07:28:20 by Rob Vesse in English
Tags: Benchmarks, BSBM, Join, Performance, SPARQL