[xmlsec] Re: non us-ascii filenames in user locale

Fri Jun 25 05:27:27 PDT 2004

On Fri, Jun 25, 2004 at 12:46:26PM +0300, Roumen Petrov wrote:

> A.) From libxml "Encodings support" page 
> (http://www.xmlsoft.org/encoding.html) :
> ....
> for examples when adding a text node to a document, the content would 
> have to be provided in the document encoding
> ....

  Totally unrelated out of context quote. This section is precisely
about why all editing document interface are using UTF-8 only strings.

> B.) From rfc2396 (http://www.ietf.org/rfc/rfc2396.txt):
> ....
>  However, there is currently
>   no provision within the generic URI syntax to accomplish this
>   identification. An individual URI scheme may require a single
>   charset, define a default charset, or provide a way to indicate the
>   charset used.
> 
>   It is expected that a systematic treatment of character encoding
>   within URI will be developed as a future modification of this
>   specification."
> ....

   IRI this is still a work in progress and XML specs are based on
URIs not IRIs except for XInclude actually. In the absence of clear contextual
information for charset you cannot expect one to be guessed from the user
locale. The fact that URLs are context free and can be copied around
without changing their meaning is one of the basic foundation
of the web architecture.

> C.) From "XML-Signature Syntax and Processing " 
> (http://www.w3.org/TR/xmldsig-core/)
> ....
> 4.3.3.1 The URI Attribute ..."
> The URI attribute identifies a data object using a URI-Reference, as 
> specified by RFC2396 [URI]. The set of allowed characters for URI 
> attributes is the same as for XML, namely [Unicode]. However, some 
> Unicode characters are disallowed from URI references including all 
> non-ASCII characters and the excluded characters listed in RFC2396 [URI, 
> section 2.4]. However, the number sign (#), percent sign (%), and square 
> bracket characters re-allowed in RFC 2732 [URI-Literal] are permitted. 
> Disallowed characters must be escaped as follows:
> 
> Each disallowed character is converted to [UTF-8] as one or more octets.
> Any octets corresponding to a disallowed character are escaped with the 
> URI escaping mechanism (that is, converted to %HH, where HH is the 
> hexadecimal notation of the octet value).
> The original character is replaced by the resulting character sequence.
> ....
> 
> 
> 
> >From A. I expect in Reference node URI to be in document encoding.

  Wrong

> >From B. I see that we are free to use in URI any charset.

  Wrong, see for example the escaping section of the XPointer spec
    http://www.w3.org/TR/xptr-framework/#escaping

> C.  define that we should use UTF-8 encoding.

  URI escaped UTF8 encoded string are the current best practice.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard at redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/