Releases > dotNetRDF 0.6.0 Released

We're pleased to announce the 0.6.0 release of the dotNetRDF API - you can get this by going to Download dotNetRDF or obtain full source code by going to Download dotNetRDF Source.

This release is a major feature release of the library which includes a number of notable performance improvements.

Memory Usage reductions

dotNetRDF now automatically interns URIs as they are parsed from various inputs, you can also intern URIs you create by calling the Create() method of the UriFactory class. In testing we saw that this reduced the memory usage by around 40% which is a big deal when you start loading hundreds of thousands or millions of triples.

Advanced User Notice: If your application primarily processes RDF in a streaming fashion this may actually increase your memory usage, in this scenario please use the Options.InternUris property to disable this function.

SPARQL Performance Improvements

As highlighted in a previous blog post the improvements to our join algorithms in our SPARQL engine have led to massive performance improvements. Performance now scales much more linearly and is far faster, for example at a million triples on BSBM v3 we saw a 14x performance improvement.

We've also rewritten how expressions are evaluating in SPARQL so expression evaluation should now be far more efficient and performant.

SPARQL 1.1 Support

We've updated our SPARQL support to match the latest SPARQL specifications once again, we now once again are 100% SPARQL 1.1 compliant as seen on the offical W3C SPARQL 1.1 Evaluation Test Results report.

Turtle 1.1 Support

We now support what is commonly referred to as Turtle 1.1 which will be the official standardised W3C blessed form of Turtle. For maximum compatibility with other tools we default to Turtle 1.1 as input and Turtle 1.0 as output.

Bug Fixes

As always we've incorporated a number of bug fixes into this release, thanks to everyone who reported bugs, you should find your names in the acknowledgments file included in our releases. This release includes a critical fix for those wanting to use AllegroGraph 4.x.

20/02/2012 00:48:26 by Rob Vesse in English
128889 Views


Twitter about this

Tags: 0.6.0, Beta, dotNetRDF, Download, Interning, Release, Releases, SPARQL, Turtle, URI

General > Improving SPARQL Performance in dotNetRDF or why sometimes you have to throw stuff out

So I recently tweeted about the new benchmarking results I was generating for dotNetRDF which showed a massive improvement in performance and linear scaling. Gregory Williams asked about whether I was going to do a write up so here it is.

Firstly some background on how the SPARQL engine in dotNetRDF works. For a start the engine is entirely in-memory and it is relatively basic in its design. The data structures used are essentially a collection of graphs, where each graph is a collection of triples, each graph maintains its own indexes over those triples and allows us to look up triples using those indexes.

So when I say the data structures are simple I mean it, there are no Node IDs, no packing of values into Nodes beyond the simple string/URI values that represent the RDF terms in fact very little cleverness at all. This of course poses some problems in building a scalable engine because when we do joins we are required to do full term equality checks to determine if two solutions are compatible - there are no quick integer comparisons for us.

This has meant that historically our join algorithm has been naive and slow, it has simply had to iterate over the two sides of the join in a doubly nested loop and check each possible pairing to see if they are compatible. Clearly this is naive and this is an O(n2) algorithm hence slow to start with and exponential scaling. While some effort has been expended in the past on various algorithms ultimately it has never been the highest of our priorities because there have been plenty of other more pressing things to implement in the library.

Ultimately sometimes you have to admit defeat and say that clearly this code sucks so how do we make it better? This is where Sprinkle SPARQL comes in. Now this algorithm was actually designed to run on a massively multi-threaded architecture where you have vast amounts of globally shared memory which you can happily use - in essence it allows you to trade space for time. If you go and read the paper the obvious problem you hit is that if you want to apply this algorithm to a single threaded engine with limited memory you don't have the memory to go around allocating the massive arrays the algorithm uses.

Therefore what we actually do is to take the basic principle behind the algorithm and implement it in a way that makes sense to a single threaded architecture. The algorithm we came up with can be summarized as follows:

  1. Determine what the joinable variables are (variables shared between LHS and RHS)
  2. Make a pass over the results from the LHS of the join and for each joinable variable build a Hash Table mapping each unique value for that variable to all the intermediate solutions that contain it
  3. Make a pass over the results from the RHS of the join using the joinable variables to look up potential matches.
    • The actual set of potential matches is the intersection of the possible matches for each joinable variable, as intermediate solution is identified by a unique integer this is a fast intersection of lists of integers.
    • For each potential match we do the full equality check to see if the solutions are actually compatible (since we have to allow for hash code collisions)

I suspect that there are quite a few other stores which implement something very similar and having seen the results of this I wish I'd implemented it a long time ago!

The advantage of this algorithm for us is that it means we actually have a much smaller number of full term equality comparisons to perform because most of the join logic now becomes hash table lookups and integer intersections which are far faster. Also this algorithm is O(2n) which means it is linear with respect to the size of the intermediate results rather than exponential.

When you compare the two algorithms graphically as in the following graph comparing the performance of the current public release against the upcoming 0.6.0 release which incorporates this brand new algorithm you can see just how much of a difference in performance this makes:

BSBM v3 Results February 2012

So the above is the results of the Berlin SPARQL Benchmark v3.1 run on a variety of dataset from 5,000 triples to 1M triples which while relatively small by some standards is about as far as the engine has previously been able to scale. This shows total runtime in seconds for 50 runs of the algorithm versus the number of triples in the dataset.

The graph is pretty damning, the old naive algorithm scales terribly while the new algorithm scales linearly making for one very pleasing line on the graph from my point of view. This also suggests that the new algorithm will allow us to scale the engine to much larger numbers of triples than we previously could since we'll now be limited more by how many triples we can load into memory rather than the speed of our joins.

My takeaway from this is sometimes you really just have to throw out your code and re-think it because when you do you can sometimes make fantastic improvements just like this!

10/02/2012 07:28:20 by Rob Vesse in English
84958 Views


Twitter about this

Tags: Benchmarks, BSBM, Join, Performance, SPARQL

 
 

Powered By Visual Log from Visual Design Studios

Visual Log is Licensed Free for Any Use on this Website (User is Unregistered)