Hi all,

recently, I've found a bug in regex that can lead to access violation.

 std::string s(".*?");
 boost::regex regEx(s.begin(), s.end());  // Potential AV here

basic_regex constructor creates a local variable of type traits::string_type and passes its the first element address and one-beyond-the-last element address to the assing() function.

   template <class InputIterator>
   basic_regex(InputIterator arg_first, InputIterator arg_last, flag_type f = regex_constants::normal)
      typedef typename traits::string_type seq_type;
      seq_type a(arg_first, arg_last);
         assign(&*a.begin(), &*a.begin() + a.size(), f);
         assign(static_cast<const charT*>(0), static_cast<const charT*>(0), f);

Calling assign() eventually leads to creation of basic_regex_parser object and calling its parse_repeat() function. m_begin and m_end members are initialized with values passed to assign().

template <class charT, class traits>
bool basic_regex_parser<charT, traits>::parse_repeat(std::size_t low, std::size_t high)
      // OK we have a perl or emacs regex, check for a '?':
      if(this->m_traits.syntax_type(*m_position) == regex_constants::syntax_question)
         greedy = false;
      // for perl regexes only check for pocessive ++ repeats.
      if((0 == (this->flags() & regbase::main_option_type))
         && (this->m_traits.syntax_type(*m_position) == regex_constants::syntax_plus))
         pocessive = true;

In parse_repeat() the m_position, member points to '?' so condition in the first if is true and m_position is advanced by ++m_position. Now it is equal to m_end.
In the next if statement, *m_position is evaluated that may cause access violation.

Actually, it is unlikely that the sample code above will cause AV. It is because seq_type in basic_regex constructor is basic_string, and basic_string's underlying buffer is usually null-terminated. So m_end points to the null terminator. It always readable, no AV can happen.
The actual code I used is slightly different but the concept is the same:

boost::u32regex regEx = boost::make_u32regex(L".*?");

u32regex and make_u32regex are provided by the ICU library. That is a way to support unicode strings.
seq_type become std::vector<int,std::allocator<int> >. Some time the "a" object's underlying buffer was allocated right in the end of page, so basic_regex_parser::m_end pointed to the start of the next page that was not allocated. Dereferencing m_position caused access violation.

My environment is Visual VC++ 9.0, Boost 1.42.0 (as far I can see, the code in Boost 1.46.1 is the same).

By the way. When traits::string_type is basic_string, the code in basic_regex::basic_regex is not portable:
assign(&*a.begin(), &*a.begin() + a.size(), f);
If I remember correctly, standard does not guarantee that basic_string's controlling sequence is contiguous.

Just in case, the call stack when AV occured:

boost::basic_regex<int,boost::icu_regex_traits>::assign<int *>+0xa2
boost::basic_regex<int,boost::icu_regex_traits>::basic_regex<int,boost::icu_regex_traits><boost::u16_to_u32_iterator<wchar_t const *,int> >+0xb8
boost::re_detail::do_make_u32regex<wchar_t const *>+0x25