Boost logo

Boost Users :

Subject: Re: [Boost-users] [interprocess] Reading huge files
From: Brian Budge (brian.budge_at_[hidden])
Date: 2013-10-11 13:14:32


On Fri, Oct 11, 2013 at 5:55 AM, Sensei <senseiwa_at_[hidden]> wrote:
> Dear all,
>
> I am new to boost memory mapping, so this question might look simplistic.
>
> I need to read huge amounts of data (for instance, a 20GB file), and since
> memory mapping is quite fast, I was going to use it. However, I don't know
> what it would be faster when, due to memory constraints, I need to partition
> the file into regions. Moreover, I should treat the file as a string (I need
> to perform string operations).
>
> What I'm trying now is just to read the entire file:
>
> boost::interprocess::file_mapping mmap(input_filename.c_str(),
> boost::interprocess::read_only);
> boost::interprocess::mapped_region map(mmap,
> boost::interprocess::read_only);
>
> std::size_t l = map.get_size(), tot_read = 0;
>
> void *ptr = map.get_address();
>
> while (tot_read < l)
> {
> register std::size_t x = std::min(l - tot_read,
> static_cast<std::size_t>(prealloc));
>
> std::copy_n(static_cast<char*>(ptr) + tot_read, x, line.begin());
>
> // Do something here...
>
> tot_read += x;
> }
>
>
> So, when the file is huge, do I need to create a mapped_region inside the
> loop? I didn't see anywhere in the documentation the possibility to move the
> mapped region.
>

If you're on a 64-bit system, you can simply mmap the entire file.
There is no need to break the file into regions just because it's huge
:) The OS will page the data in as required. On 32-bit, you do need
to manage regions because you would otherwise exceed your address
space. This might be kinda crappy as you'd ideally want to split your
regions at EOL boundaries, and you need to parse your file before you
know where these are. In practice, you'd be stuck worrying about
straddling EOL, but hey, that's the price you pay if you want to run
32-bit code.

> Another side-question, if you don't mind. I'm not sure that what I'm doing
> is efficient, especially the need to copy from the region to a string. If
> you have suggestions, I'm more than happy to hear these.

I would use boost's new string_ref instead of string. The obvious
solution would be to use boost.tokenizer to break up the giant string
into string_ref lines; however, I'm unsure that this is supported yet.
 An EOL tokenizer should be only a few lines of code though, and you
could fairly trivially tokenize your string into string_refs.

  Brian


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net