Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] GSoC Proposal Preparation For Encoding Awared String
From: Soares Chen (crf_at_[hidden])
Date: 2011-03-16 07:30:00

Next message: Stewart, Robert: "Re: [boost] [Review] Boost.Type Traits Extension by Frederic Bron"
Previous message: Stewart, Robert: "Re: [boost] [chrono] Interoperability with ICL and common concepts"
In reply to: Soares Chen: "[boost] GSoC Proposal Preparation For Encoding Awared String"
Next in thread: Andrew Sutton: "Re: [boost] GSoC Proposal Preparation For Encoding Awared String"

Just to add one more proposal that I missed. I remember Mathias
Gaunard suggested some where in the discussion to use range of two
iterators to represent code points, characters, and strings. I don't
remember the exact details but I think at least that is how
Boost.Unicode represent Unicode strings using boost::iterator_range. I
see it has similarity with other proposals, but I'll leave it for
later discussions. Mathias Gaunard please correct me if I understand
wrongly, thanks.

On Wed, Mar 16, 2011 at 1:20 AM, Soares Chen <crf_at_[hidden]> wrote:
> Hi all,
>
> I am an undergraduate student from the National University of
> Singapore and I am interested to take part in this year's GSoC with
> the Boost community. Currently I am preparing my proposal to add
> support for encoding awareness in string through new/existing Boost
> project, but there are some questions I would like to ask to clarify
> the community's interest in such a project.
>
> The inspiration I get came from several lengthy discussions I found in
> the Boost mailing list archive, which happened recently from mid
> January to mid February. [5][6][7] For a brief overview according to
> my own understanding, the heated debates was mainly around different
> ways to ensure consistency between the encoding expected when library
> code accepts strings, and the actual encoding of strings passed by
> users. The problem arise because a small minority of developers use
> std::string in different encoding than UTF-8, and the implicit
> assumption of UTF-8 encoding for std::string brings inconsistency and
> causes numerous bugs that are outside the scope of the Boost library.
>
> From the discussion, I found several proposals that have been made to
> solve the inconsistencies of std::string encoding:
> 1. Create new classes that warp around std::string, std::u16string,
> and std::u32string for each encodings and ensure encoding correctness
> simply through C++ type safety features. The classes are tentatively
> called the utf*_t classes (which many disliked the name) - Proposed by
> Chad Nelson with working prototype available. [1]
> 2. Continue to strongly enforce the assumption that all std::strings
> are UTF-8 encoded. Depreciate or make it hard to use other encodings
> in std::string.
> 3. Reinvent std::string and introduce boost::string. The new string
> class is proposed to be immutable but also delegate the encoding
> awareness to templated view<> classes that warp around boost::string,
> which IMHO the view<> classes share similarity with proposal (1). [7]
>
> Now I try not to go into the details and pros & cons of each proposal
> to avoid turning this thread into yet another string discussion. The
> original discussions have 542 messages in total, spanned a whole
> month, and I failed to find any conclusion that everyone could agree
> on. What I notice in the discussions is that there are several groups
> of people that have strong opinion on different ways to solve the
> problem and could not generally agree with each other. I also found
> that the discussions often drifted away and lose focus on the original
> problem, but every now and then someone would mention the problem
> again and proposed a solution that is similar to earlier proposals.
>
> Nevertheless, the discussion was extremely informative and insightful.
> I learned a lot by just reading through these discussions. But since I
> intend to start a GSoC project based on this subject, I hesitated on
> what I should really do in this project as I feel that there is no
> general agreement on how to solve this problem in the Boost community.
> Although I have some ideas and I personally lean towards proposal (1)
> by Chad Nelson, I think it'll be best if my project can fit in the
> interest of majority of the Boost community members. I think it is
> also best to avoid any further discussion on this topic to actively
> make the design decision, as the time period for GSoC is limited and
> the discussion tends to be never-ending.
>
> I think there are several options that I can choose for my project:
> 1. To use Chad Nelson's code as base, try to incorporate other ideas
> proposed in the mailing list, integrate with Boost.Locale, and make it
> Boost quality to submit for review. If this option is chosen, I wish
> that Chad Nelson can be my mentor.
> 2. To start a new code base, gather and compile ideas suggested in
> mailing list, final design decisions made by me and my mentor but not
> the community (to keep the project going on fast), make it Boost
> quality and submit for review.
> 3. To start the boost::string project, where another better string is
> reinvented and fix all the weaknesses of std::string.
> 4. Adopt different proposal, and improve on existing project such as
> Boost.Unicode [2] or Boost.Locale [3] such that it really solves the
> encoding awareness problem.
> 5. Any other suggestion?
>
> I hope to get feedback from you on what should I really focus on in
> this project. Of course I also hope that this subject is mature enough
> to be accepted as GSoC project as I can see great interest in the
> community to solve this problem. I would also like to clarify again
> that I do not intend to solve actual Unicode handling problems in this
> project - there are already excellent libraries such as Boost.Locale
> designed for it. My main objective is to design a set of interfaces
> that help to ensure encoding correctness and consistency when strings
> are being passed between different functions. I look forward for
> anyone that is interested in this project and is willing to be my
> mentor.
>
> Lastly, I apology for my grammar and any possible misunderstanding
> that caused by my bad writings. Please do correct me if I have missed
> anything or misunderstood some aspects. I will write a complete and
> formal proposal once I hear feedbacks from you.
>
> Thank you very much and hope that I can start contributing to the
> Boost community!
>
> Best Regards,
>
> Soares Chen
> National University of Singapore
>
>
> References:
> [1] The Oak Circle C++ (Unicode) Toolkit, by Chad Nelson.
> http://www.oakcircle.com/toolkit.html
> [2] Boost.Unicode, by Mathias Gaunard.
> http://mathias.gaunard.com/unicode/doc/html/
> [3] Boost.Locale. http://cppcms.sourceforge.net/boost_locale/html/index.html
> [4] Should UTF-16 be considered harmful?, Stack Overflow.
> http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
> [5] Always treat std::strings as UTF-8?, Boost Mailing List
> Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread/13966c1a3d4ceadd/1be0173d252deb62
> [6] What will string handling in C++ look like in the future, Boost
> Maling List Discussion.
> http://groups.google.com/group/boost-developers-archive/browse_thread/thread/deed8f95125dce02/c6e517b77f403eda
> [7] [string] proposal, Boost Mailing List Discussion.
> http://groups.google.com/group/boost-devel-archive/browse_thread/thread/f8516df28af22c4b/400f2e616de10ef0
>
> (Sorry for linking the mailing list archive to Google Groups, but I
> feel that Google Group provides better interface for reading archives
> for those who haven't read the discussions)
>

Next message: Stewart, Robert: "Re: [boost] [Review] Boost.Type Traits Extension by Frederic Bron"
Previous message: Stewart, Robert: "Re: [boost] [chrono] Interoperability with ICL and common concepts"
In reply to: Soares Chen: "[boost] GSoC Proposal Preparation For Encoding Awared String"
Next in thread: Andrew Sutton: "Re: [boost] GSoC Proposal Preparation For Encoding Awared String"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk