General > Experimenting with MongoDB as an RDF Store

So I tweeted the other week about the fact that I'd be playing with using MongoDB as an RDF Store and since there's been quite a lot of interest in that this blog post aims to cover where I've got to so far. If you have any comments/suggestions/ideas on this either leave a comment on this post or tweet me @RobVesse

MongoDB Fundamentals

So a quick introduction to MongoDB - it's one of the many NoSql databases out there and is a document oriented database where documents are chunks of JSON. Actual storage in MongoDB is using BSON (the binary serialisation of JSON) and Mongo provides powerful B-Tree based indexing features which is very useful for querying the data you put into it.

Initial Approach

The aim of my experimenting was to evaluate whether MongoDB was viable for usage as an RDF store. In terms of viability I am interested in both the ability to store and retrieve data in reasonable amounts of time but also the ability to quickly look up subsets of the data for querying purposes. From everything I'd read about MongoDB this appeared to be a good fit so I went ahead with trying it out.

As MongoDB is based around JSON you need to have some mapping from the data you want to store into a JSON schema - since RDF already has the RDF/JSON my initial effort used that. For example the following RDF graph given as NTriples would translate to the subsequent JSON:

<http://example.org/subject> <http://example.org/predicate> <http://example.org/object> .

{ 
    "http://example.org/subject" : 
    {
        "http://example.org/predicate" : 
        [ 
          { "value" : "http://example.org/object", "type" : "uri" } 
        ] 
    }
}

While this seems sensible it in practice proved to not be suitable. This is because that in MongoDB indexes are on the values of properties in the JSON not on the keys - so in the serialization above I can't actually query for Triples with the Subject and Predicate because the values are property names and not values. In addition to this since the keys are URIs they contain `.` characters which are not permissible in MongoDB since they are used as a path operator in queries (think CSS). So this approach was scrapped because it didn't meet my second criteria for viability - the ability to query the store efficiently.

Approach #2 - Graph Centric

Understanding the problems of the previous serialization I then designed the following serialization:

{
  name : "some-name",
  graph : [
            { 
              "subject" : "<http://example.org/subject>" ,
              "predicate" : "<http://example.org/predicate>" ,
              "object" : "<http://example.org/object>"
            }
          ]
}

In this serialization a Graph is serialized as document that has a name - this is typically a hash of the Graph URI used for retrieval of a specific Graph - and it has a graph property. The graph property is an array of JSON objects where each object is a single Triple with subject, predicate and object properties. Using an array of objects allows me to leverage a feature of MongoDB called multikey indexing that allows me to index on properties of objects within an array

Note now that the values of these properties are the NTriples serialization of the elements of a Triple, this is done so that we have a canonical serialization of values so that when we later want to do a query we don't have any issues of false positives/negatives due to differing serializations. This canonicalization does have some slight performance drawbacks in that we have to properly unescape values when retrieving them since NTriples escapes various characters and these escapes get re-escaped when translated into JSON by MongoDB (or at least by the Mongo C# Driver). In testing I still got sufficiently good read/write performance to justify this approach and it is important for querying.

Unfortunately I found that this approach still has some limitations:

  1. MongoDB limits the maximum size of documents so if I try and add a larger graph (10,000s of Triples) then it will fail
  2. As all Triples for each Graph are in one MongoDB document when I make a query to MongoDB I get back all the documents that have Triples matching that query but I then have to manually filter the documents to get only the relevant Triples. For a common predicate like rdf:type this means you potentially have to scan the entire store just to find the relevant triples which is obviously very inefficient.

Approach #3 - Triple Centric

With these limitations in mind I then tried a 3rd approach which was to represent every Triple as a single document in MongoDB so my serialization now looks like the following:

{
  name: "some-name" ,
  uri: "http://example.org/graph"
}

{
  subject : "<http://example.org/subject>" ,
  predicate : "<http://example.org/predicate>" ,
  object : "<http://example.org/object>" ,
  graphuri : "http://example.org/graph"
}

In this approach each Graph is now split into 1 or more documents, the first has the name and uri of the graph and simply represents the fact that a Graph exists in the Store. The additional documents are used to represent each individual triple. Note that I use a graphuri property rather than a uri property so that I can index over both properties so when I need to map between a Graph URI and a Hash I can do that easily and when I need to get all Triples from a Graph that is also easy. Having separate properties and thus separate indexes eliminates the need to retrieve irrelevant documents when making queries.

The benefits of this approach are that it addresses both the issues of the Graph Centric approach since very large Graphs can be stored since they are split into many small documents. Also now when I wish to query for Triples with specific values I can retrieve only the relevant Triples eliminating the need to filter the results since all the documents MongoDB returns are relevant so I just have to decode the documents into Triples.

Note that there is still one limitation of this schema, if you have a value for one of the properties of a Triple which is sufficiently large (typically longer literals used as objects) MongoDB will not index that value so if you then want to look up Triples based on that value MongoDB will be forced to perform a full index scan.

Assessing Performance

To asses the performance of this approach I used my standard method which is to run the BSBM benchmark against a MongoDB backed SPARQL endpoint. First off is the time taken to load data into the store:

MongoDB Load Times

The above table shows parsing times (time taken to read the data into memory) and the load time (time taken to write the data to MongoDB). As you can see with increasing numbers of Triples the load performance degrades quite quickly, a quick calculation of load speed shows that speed starts at ~12,000 Triples/second for the smallest dataset dropping down to ~ 1,200 Triples/second for the largest dataset I tested.

Once I had these datasets loaded I then set up a number of SPARQL endpoints and ran the BSBM against them achieving the following results:

MongoDB BSBM Times

Once again as the dataset gets increasingly large performance drops of quite dramatically as you can see for yourself. The most pressing question for myself was how this performance compared to the performance if the data was just in-memory which you can see in the following Graph:

MongoDB BSBM vs In-Memory BSBM

As you can see the MongoDB performance is worse than the in-memory performance and the gap in performance increases as the dataset size increases. You'll notice that my in-memory results go to much large datasets than MongoDB, this is because I decided that at the dataset size I'd gone up to performance had degraded sufficiently to show a clear difference between in-memory and MongoDB performance so I didn't feel it necessary to run further tests. Now some of this performance gap can be attributed to the additional overheads of having to communicate with the server and decoding the values but on the whole it is still fair to say that MongoDB performance is worse.

Note: While the above comparison is between different versions of the Leviathan engine the only actual difference in the engine is that 0.4.0 has been extended to allow querying out of memory datasets like MongoDB i.e. optimisations are identical.

Conclusion

So did my experiments fail? I'd say no since I successfully stored and queried RDF data in MongoDB (using full SPARQL no less!) and performance was not so bad as to make MongoDB non-viable as an RDF store. Certainly it is not going to replace dedicated RDF stores but it certainly has potential as a small-scale easy to deploy store.

Resources & Disclaimer

All the work presented here is experimental work developed as one of the many aspects of the dotNetRDF project and there is no guarantee that it will be released or that this represents it's final form though as the dotNetRDF project is open source you are use/experiment/modify the code if you wish to do so.

You can find the code used in the dotNetRDF SVN Repository under Branches/040/Trunk/Libraries Branches/Obsolete/Libraries, I communicated with MongoDB using the MongoDB C# Driver and the experimental Alexandria library (found in the SVN repository) which is an abstraction layer for storing RDF in a document-oriented manner.

19/10/2010 12:48:04 by Rob Vesse in English
21737 Views


Twitter about this

Tags: BSBM, Mongo, MongoDB, RDF, SPARQL, Store, Triple Store

General > Updated BSBM Benchmarks

Given the release of Version 0.3.1 of dotNetRDF and the fact that it contains a variety of new SPARQL optimisations I decided I should run the BSBM benchmarks again just to see how it compared. Results showed that minimum execution time is marginally increased but that overall execution time is significantly better.

BSBM Benchmark over all version of dotNetRDF

images/bsbm_results_oct2010.jpg

BSBM Benchmark Version 0.2.2 vs 0.3.1

As you can see from the following comparison 0.3.1 again offers much better scalability and faster query execution times over 0.2.2.

images/bsbm_results_oct2010_comparison.jpg

11/10/2010 12:00:31 by Rob Vesse in English
12437 Views


Twitter about this

There are currently no Tags for this Content!

Releases > dotNetRDF 0.3.1 Alpha Released

dotNetRDF 0.3.1 is now available for download, this is a minor release which primarily provides bug fixes and performance improvements. You can get it by going to Download dotNetRDF or get the source by going to Download dotNetRDF Source

New Features

New RDF/XML Writer

Added a new RdfXmlWriter class which is a fast streaming writer and is the fastest RDF/XML writer we've produced to date achieveing approximately 40,000 triples/second.

Syntax Validators

Added a new set of classes which implement an ISyntaxValidator interface which allow you to validate strings to see if they are valid syntax.

Silverlight Build

Added an experimental Silverlight build - please be aware that this has yet to be fully tested and may be buggy.

SPARQL Views

Added the ability to create a SparqlView which is a Graph that is dynamically generated based on a SPARQL query and automtically updated asychronously when the underlying data source changes (for in-memory views).

Improvements

Bug Fixes

A variety of bug fixes including the following:
  • SqlTripleStore can be properly loaded from a Configuration file
  • SparqlXmlParser and SparqlJsonParser parse Blank Node IDs more intelligently
  • Fixed to TriG reading and writing around correct QName generation and parsing corner cases
  • Fixed Turtle, N3, TriG and SPARQL Parser to support numeric literal corner cases and potentially ambigious QNames/Plain Literals
  • UriLoader handles file IO errors in the cache gracefully
  • DESCRIBE query results are not correctly parsed when returned from Sesame
  • Using UpdateGraph to delete Triples from Sesame/AllegroGraph now only deletes the desired Triples not the entire Graph

SPARQL Optimisation

Added optimisations for ASK queries, queries that uses LIMIT without an ORDER BY/GROUP BY/HAVING and evaluation of MINUS clauses

ASP.Net Improvements

Graph Handlers now output ETags and respond to requests with 304 Not Modified where appropriate

08/10/2010 13:14:26 by Rob Vesse in English
13037 Views


Twitter about this

There are currently no Tags for this Content!

 
 

Powered By Visual Log from Visual Design Studios

Visual Log is Licensed Free for Any Use on this Website (User is Unregistered)