[xep-support] RE: Rép. : Re: [xep-support] Invalid UTF-8 b yte

LUC AUDRAIN LAUDRAIN at hachette-livre.fr
Thu Jun 30 08:46:43 PDT 2005


Hello Jacques,
 
Thank you for your kind answer. I have found the reason of this ill-formed utf-8 char :
The wrong "A0" code was output directly from the text input file to the XML file, the other "A0" codes were transfered to utf-8 conversion routines. That's why this one wasn't well coded in UTF8 and the others were correct.
 
Thank you very much for your help.
 
Best regards.
 
Luc
 
The question is : why does this first one code "C2 A0" works fine, and not the next one ?
 
 

>>> Jacques.DESEYNE at swift.com 30/06/2005 14:52:24 >>>

Luc,
 
Bytes are not the same things as characters! There exist several conventions ("encodings") for representing characters by a byte sequence. XML has the Unicode character set (there are quite a lot of characters in it, see the code charts at http://www.unicode.org) and their default encoding is UTF-8, but other encodings can be used as well.
 
In an UTF-8 encoding, only characters under 127 (0x7F) are represented by a single byte. The non-breaking space character '0xA0' is represented by the byte sequence 'C2 A0'. Your sample document has some of these, for instance within the <Auteur> tag for <Ouvrage> where <Nuart> contains "9610767":
 
...
000001b0   3c 2f 54 69 74 72 65 3e 3c 41 75 74 65 75 72 3e   </Titre><Auteur>
000001c0   c2 a0 3c 2f 41 75 74 65 75 72 3e 3c 50 72 69 78   ..</Auteur><Prix
...
 
Where you see the dodgy 'A0' byte (at file offset 0x00001140, if I'm not mistaken), you should have 'C2 A0', i.e. two bytes instead of one. You may need to check how these data are generated.
 
Look for an explanation on UTF-8 (and other) encodings on the Web -- you will see that there's more about it than one might have expected.
 
Best regards,
--
Jacques Deseyne
 

From: owner-xep-support at renderx.com [mailto:owner-xep-support at renderx.com] On Behalf Of LUC AUDRAIN
Sent: Thursday, June 30, 2005 11:58 AM
To: msulyaev at renderx.com; xep-support at renderx.com
Subject: Rép. : Re: [xep-support] Invalid UTF-8 byte



Hello Michael,
 
I Think that it is an 0A I have after the xml declaration, as I have at the end of each line of this file. The invalid UTF-8 byte is a0xA0.
 
Looking a bit more precisely, I have found this 'A0' byte : it is in the ligne beginning with "<Nuart>4776027" inside the element Run.
 
Now, I still don't understand why it is an invalid UTF-8 byte, because when I open this file in UltraEdit in Hex mode I see "00A0" and "00A0" is a valid Unicode character! I may filter it here, but in some case, I may need it as it is the "NO-BREAK SPACE".
 
What's wrong.
 
 
 
 
 
Best regards
 
Luc AUDRAIN
__________________________________
DSI / Infocube
Informatique Éditoriale
HACHETTE LIVRE
43, quai de Grenelle
75015 PARIS
00 33 1 43 92 38 12
laudrain at hachette-livre.fr

>>> msulyaev at renderx.com 24/06/2005 17:28:42 >>>
Hello, Luc,

Your .xml file is invalid: it has a 0xA0 byte after the xml declaration 
and before anything else, e.g. like here (the last byte shown):

3C 3F 78 6D 6C 20 76 65 | 72 73 69 6F 6E 3D 22 31 <?xml version="1
2E 30 22 20 65 6E 63 6F | 64 69 6E 67 3D 22 55 54 .0" encoding="UT
46 2D 38 22 3F 3E 20 20 | 20 20 20 20 20 20 20 20 F-8"?>
20 20 20 20 20 20 20 20 | 20 20 20 20 20 20 20 20
A0 <

Use any HEX editor to fix.

-- 
Best regards,
Michael Sulyaev mailto:msulyaev at renderx.com 
RenderX.



LUC AUDRAIN wrote:
> Hello,
> 
> On some XML files, I have an error message on validation :
> 
> [error] Error reported by XML parser; SystemID: file:/J:/Traitement 
> BdC/Depot TXT/lg/OPERATION ARTEMIS CHASSE 23 AOUT 2005.xml; Line#: -1; 
> Column#: 949
> [error] javax.xml.transform.TransformerException: Error reported by XML 
> parser error: formatting failed: 
> javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: 
> invalid UTF-8 byte (check the XML declaration) (code: 0xa0)
> 
> I found information on the Renderx Web Site in this answer
> *From*: Mike Trotman < mike.trotman at datalucid.com 
> < mailto:mike.trotman at datalucid.com?Subject=Re:%20[xep-support]%20UTF%20data%20format >> 
> 
> *Date*: Mon May 02 2005 - 08:14:51 PDT
> and tried without success.
> 
> The workaround I found is to save the XML file again from any text or 
> xml editor (as XMLSPy) and it works fine.
> 
> In order to find what's wrong in my source file, I'd like to know how to 
> use the ligne and column information in the error message : Line#: -1; 
> Column#: 949.
> 
> Best regards.
> 
> 
> 
> 
> 
> 
> 
> Luc AUDRAIN
> __________________________________
> DSI / Infocube
> Informatique Éditoriale
> HACHETTE LIVRE
> 43, quai de Grenelle
> 75015 PARIS
> 00 33 1 43 92 38 12
> laudrain at hachette-livre.fr < mailto:laudrain at hachette-livre.fr >
> 
-------------------
(*) To unsubscribe, send a message with words 'unsubscribe xep-support'
in the body of the message to majordomo at renderx.com from the address
you are subscribed from.
(*) By using the Service, you expressly agree to these Terms of Service http://www.renderx.com/terms-of-service.html 



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.renderx.com/pipermail/xep-support/attachments/20050630/37ad78e0/attachment.html>


More information about the Xep-support mailing list