Boost Users :

Date view	Thread view	Subject view	Author view

From: John Maddock (john_maddock_at_[hidden])
Date: 2002-06-30 06:20:32

Next message: John Maddock: "Re: [Boost-Users] Regex boost_1_28_0 incorrect build / run-time problem"
Previous message: Darin Adler: "Re: [Boost-Users] Problems getting subexpressions in regex_search"
In reply to: bthiesfield: "regex with double byte character sets"
Next in thread: Eric Niebler: "Re: regex with double byte character sets"
Reply: Eric Niebler: "Re: regex with double byte character sets"

> I am currently trying to use the boost regex library with Japanese
> language strings. It appears like DBCS is not supported. For
> example, using the following code (with compile definition of
> BOOST_REGEX_USE_C_LOCALE) I get the output strings as
>
> 0 = "$B!#(B"
> 1 = "English"
>
> Instead of the expected:
>
> 0 = "$B$d$f$h$o$r!<!#(B"
> 1 = "English"
>
> This is due to the fact that the Japanese (SJIS encoding) for one of
> these characters uses the [ character as one of the characters in the
> encoding.
>
> setlocale( LC_COLLATE, "Japanese" );
> setlocale( LC_CTYPE, "Japanese" );
>
> char * pszText = "$B$d$f$h$o$r!<!#(B [english]",
> char * pszRule ="([^\\[]*)\\[([[:word:]]*)\\]";
>
> // split the string into it's components
> std::vector<std::string> vPart;
> boost::regex eParseExpr( pszRule,
> boost::regbase::normal | boost::regbase::icase );
> boost::regex_split( std::back_inserter(vPart),
> std::string(pszText), eParseExpr );
>
> Is there some what to modify the library to enable DBCS? For
> example, can the char_traits be modified to enable DBCS processing?
> (Keeping in mind that the biggest problem with DBCS is that a single
> character may consist of 2 bytes which tends to blow out all
> assumptions about the size of characters).
>
> Brodie.

To be honest I know nothing at all about DBCS, but I assumed that very code
point was represented by *exactly two* characters. If that's the case then
I think it might be possible, one would have to create a new data type,
something like:

struct DBCS_proxy
{
char bytes[2];
};

then create a traits class for DBCS_proxy, and cast all char* strings to
DBCS_Proxy*'s when calling the regex functions. Really I'm just thinking
out loud here, I haven't tried it and I don't know if it would work.

Otherwise can you use Unicode?

John Maddock
http://ourworld.compuserve.com/homepages/john_maddock/index.htm

Next message: John Maddock: "Re: [Boost-Users] Regex boost_1_28_0 incorrect build / run-time problem"
Previous message: Darin Adler: "Re: [Boost-Users] Problems getting subexpressions in regex_search"
In reply to: bthiesfield: "regex with double byte character sets"
Next in thread: Eric Niebler: "Re: regex with double byte character sets"
Reply: Eric Niebler: "Re: regex with double byte character sets"

Date view	Thread view	Subject view	Author view

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net