
Has anybody managed to get tokenizer working for wide characters with VC7.1 (boost version 1.31.0)? The following example works fine... typedef tokenizer<char_separator<std::string::value_type>, std::string::const_iterator, std::string> MyTokenizer; const char_separator<std::string::value_type> sep("a"); MyTokenizer token(std::string("abacadaeafag"), sep); for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) { std::cout << *it; } ...while the following example produces no output... typedef tokenizer<char_separator<std::wstring::value_type>, std::wstring::const_iterator, std::wstring> MyTokenizer; const char_separator<std::wstring::value_type> sep(L"a"); MyTokenizer token(std::wstring(L"abacadaeafag"), sep); for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) { std::wcout << *it; } Cheers, Douglas.

Messagetokenizer worked in Unicode for me, so I experimented with your example to try to find out what made the difference. To simplify building in different modes, I changed it to the following: // ==== BEGIN CODE ==== // Unicode Build: cl /D_UNICODE /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // DBCS Build: cl /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // #include <string> #include <string> #include <iostream> #include <boost/tokenizer.hpp> #ifdef _UNICODE typedef std::basic_string<wchar_t> string_t; #define _T(x) L##x #define STDOUT std::wcout #else typedef std::basic_string<char> string_t; #define _T(x) x #define STDOUT std::cout #endif typedef string_t::value_type char_t; typedef boost::tokenizer < boost::char_separator<char_t>, string_t::const_iterator, string_t
MyTokenizer;
const boost::char_separator<char_t> sep(_T("a")); int main() { #ifdef _BUG MyTokenizer token(string_t(_T("abacadaeafag")), sep); #else string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep); #endif for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) STDOUT << *it; return 0; } // ==== END CODE ==== The following table shows the output when _UNICODE and _BUG are defined: _UNICODE _BUG Output ----------------------------- undef def " bcdefg" def def "" undef undef "bcdefg" def undef "bcdefg" It seems that the tokenizer constructor is handling both Unicode and MBCS temporary strings incorrectly, with VC7.1. Keith MacDonald "Douglas G. Hanley" <DHanley@neverfailgroup.com> wrote in message news:8E1D6FAA50041A4CB4C2A0179B608D153B6412@ng-ald-mail.aldermaston.neverfailgroup.com... Has anybody managed to get tokenizer working for wide characters with VC7.1 (boost version 1.31.0)? The following example works fine... typedef tokenizer<char_separator<std::string::value_type>, std::string::const_iterator, std::string> MyTokenizer; const char_separator<std::string::value_type> sep("a"); MyTokenizer token(std::string("abacadaeafag"), sep); for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) { std::cout << *it; } ...while the following example produces no output... typedef tokenizer<char_separator<std::wstring::value_type>, std::wstring::const_iterator, std::wstring> MyTokenizer; const char_separator<std::wstring::value_type> sep(L"a"); MyTokenizer token(std::wstring(L"abacadaeafag"), sep); for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) { std::wcout << *it; } Cheers, Douglas. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

I've run the same tests with gcc 3.2.2 on RH9, without any problems, so I'll post this on the microsoft.public.vc.language newsgroup. Keith MacDonald "Keith MacDonald" <boost@mailclan.net> wrote in message news:c177ag$79p$1@sea.gmane.org...
Messagetokenizer worked in Unicode for me, so I experimented with your example to try to find out what made the difference. To simplify building in different modes, I changed it to the following:
// ==== BEGIN CODE ==== // Unicode Build: cl /D_UNICODE /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // DBCS Build: cl /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // #include <string> #include <string> #include <iostream> #include <boost/tokenizer.hpp>
#ifdef _UNICODE typedef std::basic_string<wchar_t> string_t; #define _T(x) L##x #define STDOUT std::wcout #else typedef std::basic_string<char> string_t; #define _T(x) x #define STDOUT std::cout #endif
typedef string_t::value_type char_t;
typedef boost::tokenizer < boost::char_separator<char_t>, string_t::const_iterator, string_t
MyTokenizer;
const boost::char_separator<char_t> sep(_T("a"));
int main() { #ifdef _BUG MyTokenizer token(string_t(_T("abacadaeafag")), sep); #else string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep); #endif
for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) STDOUT << *it;
return 0; } // ==== END CODE ====
The following table shows the output when _UNICODE and _BUG are defined:
_UNICODE _BUG Output ----------------------------- undef def " bcdefg" def def "" undef undef "bcdefg" def undef "bcdefg"
It seems that the tokenizer constructor is handling both Unicode and MBCS temporary strings incorrectly, with VC7.1.
Keith MacDonald

On Sat, 21 Feb 2004 09:10:43 -0000, Keith MacDonald wrote:
MyTokenizer token(string_t(_T("abacadaeafag")), sep);
if you take a look into tokenizer constructor template <typename Container> tokenizer(const Container& c,const TokenizerFunc& f) : first_(c.begin()), last_(c.end()), f_(f) { } you will notice that it's not storing copy of its string argument; instead it stores only its begin and end iterator. When string variable is destroyed (and in your example it's temporary variable; thus its destroyed at the end of expression) these iterators are no longer valid. Problem you are experiencing here is unusual manifestation of undefined behaviour - you are working in invalid interators. I think that program crash would be better indicatation that you have serious problem, but undefined behaviour may manifest in any other way - this time it's just as if tokenizer is empty. Of course, slight change of program or compilation options may result in crash (or anything else), until you remove undefined behaviour:
string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep);
B.

Hmmm. I've been trying to use various members of the boost library as black boxes, but this issue highlights the danger of doing so. I suppose a language keyword is needed to specify when a non-temporary object is required as an actual parameter. Given that there's no such thing, perhaps it would be safer to eliminate such convenience constructors from the library? Keith MacDonald "Bronek Kozicki" <brok@rubikon.pl> wrote in message news:1d55128kvroww.cbh559gfhwom.dlg@40tude.net...
On Sat, 21 Feb 2004 09:10:43 -0000, Keith MacDonald wrote:
MyTokenizer token(string_t(_T("abacadaeafag")), sep);
if you take a look into tokenizer constructor
template <typename Container> tokenizer(const Container& c,const TokenizerFunc& f) : first_(c.begin()), last_(c.end()), f_(f) { }
you will notice that it's not storing copy of its string argument; instead it stores only its begin and end iterator. When string variable is destroyed (and in your example it's temporary variable; thus its destroyed at the end of expression) these iterators are no longer valid. Problem you are experiencing here is unusual manifestation of undefined behaviour - you are working in invalid interators. I think that program crash would be better indicatation that you have serious problem, but undefined behaviour may manifest in any other way - this time it's just as if tokenizer is empty. Of course, slight change of program or compilation options may result in crash (or anything else), until you remove undefined behaviour:
string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep);
B.

On Sun, 22 Feb 2004 20:36:54 -0000, Keith MacDonald wrote:
Hmmm. I've been trying to use various members of the boost library as black boxes, but this issue highlights the danger of doing so. I suppose a language keyword is needed to specify when a non-temporary object is required as an actual parameter. Given that there's no such thing, perhaps it would be safer to eliminate such convenience constructors from the library?
I think that simplest thing to do would be to explain the problem in tokenizer documentation. B. PS. There is a chance that C++ will be enriched with syntax allowing to detect rvalue (temporary value) used as function parameter, see: http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2002/n1377.htm
participants (3)
-
Bronek Kozicki
-
Douglas G. Hanley
-
Keith MacDonald