Bug CORE-251
1 vote

Url encoded characters in URIs

Created by mmmmmrob on 7/12/2012 2:55 PM Last Updated by Rob Vesse on 7/13/2012 12:33 AM
%
 (hrs)
Logged: 0  (hrs)

 Description

The following test case, in RdfXmlTests, can be added to show the issue

 

        [TestMethod]
        public void ParsingRdfXmlWithUrlEscapedNodes()
        {
            NTriplesFormatter formatter = new NTriplesFormatter();
            RdfXmlParser domParser = new RdfXmlParser(RdfXmlParserMode.DOM);
            Graph g = new Graph();
            domParser.Load(g, "urlencodes-in-rdfxml.rdf");
 
            IUriNode encodedNode = g.GetUriNode(new Uri("http://example.com/some%40encoded%2FUri"));
            Assert.IsNotNull(encodedNode, "The encoded node should be returned by its encoded URI");
            
            IUriNode unencodedNode = g.GetUriNode(new Uri("http://example.com/some@encoded/Uri"));
            Assert.IsNotNull(unencodedNode, "The unencoded node should be returned by its unencoded URI");
 
            IUriNode encoded = g.CreateUriNode(new Uri("http://example.org/schema/encoded"));
            Assert.IsTrue(g.ContainsTriple(new Triple(encodedNode, encoded, g.CreateLiteralNode("true"))), "The encoded node should have the property 'true' from the file");
            Assert.IsTrue(g.ContainsTriple(new Triple(unencodedNode, encoded, g.CreateLiteralNode("false"))), "The unencoded node should have the property 'false' from the file");
 
        }

    Rob Vesse (Friday, July 13, 2012 12:31 AM) #

Tracked this down to a bug in Tools.ResolveUri() which was using the .Net Uri machinery to do URI resolution but due to how it returned values could lose some encoding information, with this fixed the provided test case passes sucessfully

 

Fix is in revision 2260

    Rob Vesse (Friday, July 13, 2012 12:18 AM) #

So I figured out that this is partly down to .Net URI encoding behaviour though not quite in the way either of us thought.  The problem is that @ is a reserved character which may optionally be percent encoded so when percent encoded .Net leaves it percent encoded in it's internal representation.  I believe .Net leaves it percent encoded because of the following section from RFC 3986 (added emphasis is mine)

 

 

reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

   The purpose of reserved characters is to provide a set of delimiting
   characters that are distinguishable from other data within a URI.
   URIs that differ in the replacement of a reserved character with its
   corresponding percent-encoded octet are not equivalent.  Percent-
   encoding a reserved character, or decoding a percent-encoded octet
   that corresponds to a reserved character, will change how the URI is
   interpreted by most applications.  Thus, characters in the reserved
   set are protected from normalization and are therefore safe to be
   used by scheme-specific and producer-specific algorithms for
   delimiting data subcomponents within a URI.

 

Thus I think .Net purposefully chooses not to decode the %40 when creating the URI in code because that can change the meaning of the URI, changing the %40 for another more inoccuous percent encoding like %20 (and thus replace @ with a space in the corresponding unencoded form) does show the expected behaviour.

 

The but here is that when creating it from the RDF/XML it does seem to get decoded so I'm looking into why that is because that might be an actual bug

    Rob Vesse (Thursday, July 12, 2012 6:07 PM) #

Thanks for the report, I will look into this and report back.  First glance is that this may be due to .Net's URI handling behaviour in which case it may not be fixable but I haven't debugged this properly yet so can't be sure yet