|
So I tweeted the other week about the fact that I'd be playing with using MongoDB as an RDF Store and since there's been quite a lot of interest in that this blog post aims to cover where I've got to so far. If you have any comments/suggestions/ideas on this either leave a comment on this post or tweet me @RobVesse
MongoDB Fundamentals
So a quick introduction to MongoDB - it's one of the many NoSql databases out there and is a document oriented database where documents are chunks of JSON. Actual storage in MongoDB is using BSON (the binary serialisation of JSON) and Mongo provides powerful B-Tree based indexing features which is very useful for querying the data you put into it.
Initial Approach
The aim of my experimenting was to evaluate whether MongoDB was viable for usage as an RDF store. In terms of viability I am interested in both the ability to store and retrieve data in reasonable amounts of time but also the ability to quickly look up subsets of the data for querying purposes. From everything I'd read about MongoDB this appeared to be a good fit so I went ahead with trying it out.
As MongoDB is based around JSON you need to have some mapping from the data you want to store into a JSON schema - since RDF already has the RDF/JSON my initial effort used that. For example the following RDF graph given as NTriples would translate to the subsequent JSON:
<http://example.org/subject> <http://example.org/predicate> <http://example.org/object> .
{
"http://example.org/subject" :
{
"http://example.org/predicate" :
[
{ "value" : "http://example.org/object", "type" : "uri" }
]
}
}
While this seems sensible it in practice proved to not be suitable. This is because that in MongoDB indexes are on the values of properties in the JSON not on the keys - so in the serialization above I can't actually query for Triples with the Subject and Predicate because the values are property names and not values. In addition to this since the keys are URIs they contain `.` characters which are not permissible in MongoDB since they are used as a path operator in queries (think CSS). So this approach was scrapped because it didn't meet my second criteria for viability - the ability to query the store efficiently.
Approach #2 - Graph Centric
Understanding the problems of the previous serialization I then designed the following serialization:
{
name : "some-name",
graph : [
{
"subject" : "<http://example.org/subject>" ,
"predicate" : "<http://example.org/predicate>" ,
"object" : "<http://example.org/object>"
}
]
}
In this serialization a Graph is serialized as document that has a name - this is typically a hash of the Graph URI used for retrieval of a specific Graph - and it has a graph property. The graph property is an array of JSON objects where each object is a single Triple with subject, predicate and object properties. Using an array of objects allows me to leverage a feature of MongoDB called multikey indexing that allows me to index on properties of objects within an array
Note now that the values of these properties are the NTriples serialization of the elements of a Triple, this is done so that we have a canonical serialization of values so that when we later want to do a query we don't have any issues of false positives/negatives due to differing serializations. This canonicalization does have some slight performance drawbacks in that we have to properly unescape values when retrieving them since NTriples escapes various characters and these escapes get re-escaped when translated into JSON by MongoDB (or at least by the Mongo C# Driver). In testing I still got sufficiently good read/write performance to justify this approach and it is important for querying.
Unfortunately I found that this approach still has some limitations:
- MongoDB limits the maximum size of documents so if I try and add a larger graph (10,000s of Triples) then it will fail
- As all Triples for each Graph are in one MongoDB document when I make a query to MongoDB I get back all the documents that have Triples matching that query but I then have to manually filter the documents to get only the relevant Triples. For a common predicate like rdf:type this means you potentially have to scan the entire store just to find the relevant triples which is obviously very inefficient.
Approach #3 - Triple Centric
With these limitations in mind I then tried a 3rd approach which was to represent every Triple as a single document in MongoDB so my serialization now looks like the following:
{
name: "some-name" ,
uri: "http://example.org/graph"
}
{
subject : "<http://example.org/subject>" ,
predicate : "<http://example.org/predicate>" ,
object : "<http://example.org/object>" ,
graphuri : "http://example.org/graph"
}
In this approach each Graph is now split into 1 or more documents, the first has the name and uri of the graph and simply represents the fact that a Graph exists in the Store. The additional documents are used to represent each individual triple. Note that I use a graphuri property rather than a uri property so that I can index over both properties so when I need to map between a Graph URI and a Hash I can do that easily and when I need to get all Triples from a Graph that is also easy. Having separate properties and thus separate indexes eliminates the need to retrieve irrelevant documents when making queries.
The benefits of this approach are that it addresses both the issues of the Graph Centric approach since very large Graphs can be stored since they are split into many small documents. Also now when I wish to query for Triples with specific values I can retrieve only the relevant Triples eliminating the need to filter the results since all the documents MongoDB returns are relevant so I just have to decode the documents into Triples.
Note that there is still one limitation of this schema, if you have a value for one of the properties of a Triple which is sufficiently large (typically longer literals used as objects) MongoDB will not index that value so if you then want to look up Triples based on that value MongoDB will be forced to perform a full index scan.
Assessing Performance
To asses the performance of this approach I used my standard method which is to run the BSBM benchmark against a MongoDB backed SPARQL endpoint. First off is the time taken to load data into the store:
The above table shows parsing times (time taken to read the data into memory) and the load time (time taken to write the data to MongoDB). As you can see with increasing numbers of Triples the load performance degrades quite quickly, a quick calculation of load speed shows that speed starts at ~12,000 Triples/second for the smallest dataset dropping down to ~ 1,200 Triples/second for the largest dataset I tested.
Once I had these datasets loaded I then set up a number of SPARQL endpoints and ran the BSBM against them achieving the following results:
Once again as the dataset gets increasingly large performance drops of quite dramatically as you can see for yourself. The most pressing question for myself was how this performance compared to the performance if the data was just in-memory which you can see in the following Graph:
As you can see the MongoDB performance is worse than the in-memory performance and the gap in performance increases as the dataset size increases. You'll notice that my in-memory results go to much large datasets than MongoDB, this is because I decided that at the dataset size I'd gone up to performance had degraded sufficiently to show a clear difference between in-memory and MongoDB performance so I didn't feel it necessary to run further tests. Now some of this performance gap can be attributed to the additional overheads of having to communicate with the server and decoding the values but on the whole it is still fair to say that MongoDB performance is worse.
Note: While the above comparison is between different versions of the Leviathan engine the only actual difference in the engine is that 0.4.0 has been extended to allow querying out of memory datasets like MongoDB i.e. optimisations are identical.
Conclusion
So did my experiments fail? I'd say no since I successfully stored and queried RDF data in MongoDB (using full SPARQL no less!) and performance was not so bad as to make MongoDB non-viable as an RDF store. Certainly it is not going to replace dedicated RDF stores but it certainly has potential as a small-scale easy to deploy store.
Resources & Disclaimer
All the work presented here is experimental work developed as one of the many aspects of the dotNetRDF project and there is no guarantee that it will be released or that this represents it's final form though as the dotNetRDF project is open source you are use/experiment/modify the code if you wish to do so.
You can find the code used in the dotNetRDF SVN Repository under Branches/040/Trunk/Libraries Branches/Obsolete/Libraries, I communicated with MongoDB using the MongoDB C# Driver and the experimental Alexandria library (found in the SVN repository) which is an abstraction layer for storing RDF in a document-oriented manner.
19/10/2010 12:48:04 by Rob Vesse in English
21737 Views
Tags: BSBM, Mongo, MongoDB, RDF, SPARQL, Store, Triple Store
|