Boost logo

Boost Users :

Subject: Re: [Boost-users] [boost.serialization How does boost.serialization do with BOM
From: Tan, Tom (Shanghai) (TTan_at_[hidden])
Date: 2008-09-05 04:16:53


Thanks for the quick response.

BOM is Windows specific. In my opinion, BOM is not really related to how
you encode the text(thus not related to uft_codecvt_facet), but how you
mark what your encoding is; so that any text editor can get a prompt of
how to handle the text.

It's implemented by inserting a few bytes to the very beginning of the
file, which are never used in the chosen encoding of the following code.
In the case of UTF-8, "EF BB BF" are used -- in the encoding table of
UTF-8, "EF BB BF" should correspond to no character(I did not check,
just out of guess).

As it's related to general text files, not specific to xml files.
basic_text_iarchive might be a better place to address the issue.

I am thinking just detecting " EF BB BF " and discarding them if they
exist would solve the issue.

But I am not sure which method need to be overriden, can you please
advise?

Thanks,
tom

>This is news to me.

>the wide character text/xml archives use UTF-8. They do this
>by creating a stream with the uft_codecvt_facet. I used
>this factet, it worked great and I moved on. So you're way
>ahead of me on this.

>This would probably be easy to address in the xml_iarchive code
>or perhaps the xml_grammar - but, as I said, I don't know
>anything about it.

>Robert Ramey

Tan, Tom (Shanghai) wrote:
>> what is BOM?
>
>> Probably "Byte Order Mark", see
> http://en.wikipedia.org/wiki/Byte-order_mark
>
> Yes, That's what I meant.
>
> I was testing the demo_xml_load.cpp and demo_xml_save.cpp available
> in the boost.serialization example.
> By simply opening demo_save.xml produced by demo_xml_save.exe with XML
> copy editor(http://xml-copy-editor.sourceforge.net/) and saving it
> back, demo_xml_load.exe would crash. I compared the two files with
> Winmerge. It said it's identical.
>
> by studying the hex view, I later found it's because the 3-byte UTF-8
> BOM was inserted to the beginning of file. It would not change the
> data, and in many cases was ignored by the text editors.
>
> I thinking that Boost.serialization should also handle this for all
> text files including XML.
>
> Tom

------------------------------

_______________________________________________
Boost-users mailing list
Boost-users_at_[hidden]
http://lists.boost.org/mailman/listinfo.cgi/boost-users

End of Boost-users Digest, Vol 1744, Issue 1
********************************************


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net