Boost logo

Boost :

Subject: Re: [boost] ACID transactions/flushing/MySQL InnoDB
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2009-12-10 15:47:53


On Fri, Dec 11, 2009 at 6:02 AM, Stefan Strasser <strasser_at_[hidden]> wrote:
>
> thank you very much for both your answers.
> but they don't solve the mystery, please see below.
>

Let me try one more time. ;)

>
> Am Thursday 10 December 2009 16:40:12 schrieb Dean Michael Berris:
>> > what I find surprising is the performance of MySQL for small
>> > transactions. while my library spends 90% of its time waiting for the
>> > disk to flush a transaction, InnoDB uses 100% CPU and barely uses any
>> > disk.
>>
>> Am I reading this correctly, you're doing something with persistence
>> of tuples to disk, allowing indexed access to this persisted data, and
>
> persistence yes, tuples no. persistent objects.
> in the backend that means: chunks of arbitrary sized data.
> there is some indexing going on, but it isn't relevant to this case because
> small transactions are dumped to the log on commit. the data files are only
> updated from time to time, to avoid additional disk syncs.
>

When you say logged on commit, are you writing to a file that's
already created with enough space, or are you writing to a file which
is "grown" every time you add data to it? Apparently this matters and
this is how InnoDB does its log file management.

What you can try doing is creating a fixed-size file first (initially
maybe 10M) then writing from the start of the file and growing the
file in fixed size increments.

> Am Thursday 10 December 2009 16:56:27 schrieb Jose:
>> This is the setting where you can change the performance vs durability
>> tradeoff:
>>
>> If the value of innodb_flush_log_at_trx_commit is
>
> thanks, I didn't know about this. but the value was 1 by default, so MySQL
> claims to flush for each transactions.
>
> but, including all the SQL parsing, running 10000 small transactions takes
> about 4 seconds on my system.
> while this little code snippet alone takes 6-7 seconds, with much more disk
> usage:
>

What InnoDB does is puts everything that fits in memory up there, and
keeps it there. Small transactions will write to the log file, but not
write all the data directly to disk right away.

>  int fd=open("test.bin",O_RDWR | O_CREAT | O_NOATIME);
>  for(int c=0;c<10000;++c){
>    write(fd,&data,1);
>    fsync(fd);
>  }
>

It looks like you're writing to a file that is created but not set to
contain enough data to hold all the 10k data elements. Try zero'ing
out the file first (or growing it to the desired size) then try again.
What you want to do to increase performance (on Linux at least):

  * Leverage the VFS Cache
  * Allow the OS to keep your file in memory unknown to you
  * Leverage the buffers at the OS level

Because you're using fsync, you're asking the kernel to do it for you
-- and if your file is already in the vfs cache, the chances of fsync
returning quicker is higher due to write caching at the OS level.

> neither fdatasync(), O_WRONLY, nor writing whole sectors instead of just a
> byte make a difference. MySQL also claims to use fsync on the page you
> linked.
>
> do you have an explanation for this?
>

They use fsync on the log file, not the data file which IIRC is still
an mmap'ed region. And because their file is already of a certain size
and is usually cached effectively by the OS, they get better
performance numbers even if they keep using fsync on a descriptor.

>> then allowing someone to embed it in their application. This sounds a
>> lot like an embedded database system and looking at SQLite and BDB
>
> I have looked at SQLite, they seem to use the same method as I do.
> judging by this FAQ entry, with similar results:
> http://www.sqlite.org/faq.html#q19
> (although I get much more than 60 transactions a second. I guess they have
> calculated that from the RPM without actually trying it. fsync() returns
> about 1600 times/s on my old 10 GB disk, presumable due to hardware
> caching.).
>

That's not surprising -- and if you're on Linux you even have vfs
caching going for you if your file is already pre-grown.

>> Another means that they do away with the flush is by using memory
>> mapped regions in Linux/UNIX, and then delegate the flushing to the
>> operating system.
>
> I have implemented that, with an own set of problems, see this:
> http://lists.boost.org/boost-users/2009/12/54449.php
>

Yeah, unfortunately those are platform-specific issues you're running
into. Ultimately you're at the mercy of the filesystem implementation
you're using -- if it's journalling -- or whether you're accessing a
block device directly so that you can implement the caching and
implementation on your own. MySQL's InnoDB supports raw block storage
although I doubt you want to go to that level of control.

>> BTW, I am interested in what you're doing and maybe using a library
>> that tackles the embedded database space in a modern C++ manner.
>> Looking forward to hearing more about what you're working on.
>
> I've posted a description for the Libraries Under Construction page (not yet
> there) in this thread:
>
> http://lists.boost.org/Archives/boost/2009/12/159900.php
>

Sweet, thanks for the link. I'll look forward to progress soon. :)

-- 
Dean Michael Berris
blog.cplusplus-soup.com | twitter.com/mikhailberis
linkedin.com/in/mikhailberis | facebook.com/dean.berris | deanberris.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk