Boost logo

Boost Users :

Subject: Re: [Boost-users] regex_replace and Unicode ( Cyrillic ) problem
From: Viatcheslav.Sysoltsev_at_[hidden]
Date: 2012-04-02 04:10:31


On Sat, 31 Mar 2012 11:05:37 +0200, valery O <egersheldster_at_[hidden]>
wrote:

> Hi,
>
> I have a string with 2 Cyrillic words новый дом repeated 3 times.
> I try to regex_replace each occurrance of these words.
> For this I use std:: string format("$1 красный $2")
> and regex pattern ("(\\W+)\\s+(\\W+)"),
>
> The result is only the last occurrence is replaced,the 2 preceding ones
> are not.
>
> Where is my mistake?
> My code:
>
> #include <iostream>
> #include <string>
> #include "boost/regex.hpp"
> using namespace std;
>
> int main(int argc, const char** argv)
> {
> std::string str( "новый дом, новый дом новый дом" );
> regex regx("(\\W+)\\s+(\\W+)");
> std::string format( "$1 красный $2");
> cout<<"regex_replace :"<<regex_replace( str, regx, format );
> return 0;
> }
>

Hi Valery,

First, your pattern should be:
(\\w+)\\s+(\\w+) note lowercase \w

Second, you probably do not set locale for regex properly. I do not have a
machine with russian system locale under hand to check default behavior,
but I succeeed using basic_regex::imbue():

#include <iostream>
#include <string>
#include "boost/regex.hpp"
#include <locale>
using namespace std;

int main(int argc, const char** argv)
{
    std::string str( "новый дом, новый дом новый дом" );

    boost::regex regx;
    regx.imbue(std::locale("russian"));
    regx.assign("(\\w+)\\s+(\\w+)");
    std::cout << "Search string: " << str << ", pattern: " << regx.str() <<
std::endl;
    std::string format( "$1 красный $2");
    cout << "regex_replace: " << regex_replace( str, regx, format ) <<
std::endl;
    return 0;
}

gives:
Search string: новый дом, новый дом новый дом, pattern: (\w+)\s+(\w+)
regex_replace: новый красный дом, новый красный дом новый красный дом

Note you must assign pattern after imbue() call. imbue() invalidates
pattern if called afterwards.

-- Slava


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net