|
So one of my favourite aspects of the library is the SPARQL engine - I fully admit it's not the best out there but it's becoming increasingly powerful as I implement more and more of the SPARQL 1.1 specification and I continue to come up with ways to optimise the engine.
Note that all my effort is now focused on the Leviathan engine which was introduced in the 0.2.0 release and is far more powerful and maintainable than the previous Labyrinth engine which is due to be removed in future releases. This means that any improvements and new features I add are only in the Leviathan engine, if you aren't already using it you should be (it's set as the default engine from the 0.2.0 release onwards)
SPARQL Property Paths
So I've talked about SPARQL Property Paths previously and this is a feature which I just love - initially as an implementer I was worried about the cost (and realistically they can be quite complex and costly to evaluate) but they allow you to write really useful queries e.g.
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {<http://example.org/someone> foaf:knows+ / foaf:name ?name }
The above example finds the names all the people that the person identified by the URI <http://example.org/someone> knows in one/more steps. So this literally finds the names of friends of friends!
But property paths also allows us to do things like implicit inference e.g.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex: <http://example.org/schema#>
SELECT * WHERE { ?class rdfs:subClassOf+ ex:SuperClass }
The above will find anything which is a sub-class of the given superclass regardless of the number of steps between the two classes.
Note that you aren't limited to having specific path starts or ends so you can do the following which finds all paths between a class and its superclasses:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE { ?class rdfs:subClassOf+ ?superclass }
Path Lengths
Path Lengths is an open issue for the working group and is unlikely to make it to standardisation but when I implemented support for full property paths I realised my implementations gives this as a by-product so I added a syntax extension like so:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {<http://example.org/someone> foaf:knows+ LENGTH ?l ?person }
The above example gives you all the people that know the person identified by the URI <http://example.org/someone> and tells you how many steps away from them they are. Note that paths do not permit duplicates so you'll always get the shortest path length possible.
The LENGTH ?var syntax must follow the end of a path and binds the length of that path to the given variable. This then allows you to filter/order by and do other calculations on the length of the path. Note that doing a FILTER(?var > 4) is not advised, it's far better to just explicitly state the cardinality constraints in your path e.g. foaf:knows{4,}
IN and NOT IN
SPARQL 1.1 provides for two new operators which allow you to filter a variable to a set of values e.g.
PREFIX : <http://example.org/schema>
SELECT * WHERE {?s ?p ?o . FILTER(?p IN (:propertyOne :propertyTwo)) }
Note that this is not necessarily the best way to write queries which are efficient, it simply gives you a shorthand as opposed to writing the following:
PREFIX : <http://example.org/schema>
SELECT * WHERE {?s ?p ?o . FILTER(?p = :propertyOne || ?p = :propertyTwo) }
Performance
Being an in-memory engine performance is always going to be an issue for Leviathan so we've worked hard to improve performance over time. Unfortunately while we've made some impressive improvements we've struggled with scalability up until now. I've just today been working on a new indexing optimisation technique which shows dramatic results in testing and leads to linear rather than exponential scaling. Below are some results from the Berlin SPARQL Benchmark comparing different versions of our SPARQL Engine:
We use the same datasets generated with the BSBM dataset generator using the scale factors on the X axis and perform 50 runs with 10 warmups for our testing. The value given on the X axis is the total runtime for the 50 runs.
As can be seen the improvement is massive, the improved optimisations allow us to run the benchmarks against datasets which were previously far too slow to even attempt. You'll probably notice that there is some data missing for some implementations - this is because we decided when we ran the tests originally that the performance difference was so minimal that it was unnecessary to run further tests. We use these tests mostly as a guide and a check to see if experimental optimisations have had a beneficial effect on performance so if they should little/no difference we tend to limit the number we run. 19/03/2010 17:26:56 by Rob Vesse in English
14777 Views
Tags: Benchmark, BSBM, dotNetRDF, Leviathan, Query, SPARQL
|