Firstly a few tips that might be useful when doing perf tests:

* it's a good idea to use all possible compiler optimizations (-O2/O3 -march=<your-arch> and maybe -m64) so that both data structs get a chance to perform at their best.
* it's better to measure only the actual operation you are interested in; i.e. use something like clock_gettime() within the program itself; this eliminates other overheads such thread creation time affecting your results.
* Identify what parameters could affect your results, and change (strictly) 1 parameter at a time, and run a full test suite. (producer/consumer count, transfer rate, etc)
* run the same test multiple times to ensure that the test is stable;

Based on the experience on some haswell machines (with cores isolated and threads pinned):
* with one producers and one consumer (low contention) running on the same NUMA node, even at high transfer rates (~50k transfers/sec), spin locked queues perform the same as boost::lockfree  (~ 350ns transfer latency), and boost::lockfree_spsc performs 2x better.
* when a NUMA node is crossed, and at low transfer rates (~1k rate), spin locked queue performs 3x better than lockfree and 2x better than lockfree_spsc.

But the real benefit of lockfree is visible when the number of producers and consumers are increased, and transfer rate is increased (i.e. very high contention); even with a NUMA cross, lockfree beats spin locked queue by as much as 6x.

It's difficult to say which of these factors were affecting you, but the best way to figure it is to change parameters and test yourself.

Thanks
SampathT

On Sun, Jul 10, 2016 at 7:52 PM, Leon Mlakar <leon@digiverse.si> wrote:
On 10.07.2016 15:56, gao1738@sina.com wrote:
Thanks for the response!

What I want to know are:
1. when I should use the boost::lock_free?

2. What is the main usage of boost::lock_free? Not for better performance?

3. what is the best practice for boost::lock_free?

4. why the better the machine is, the worse the performance for boost::lock_free?

In my test 3, by using 32core machine, I test threads from 1 to 32.
In all of the test cases, the boost::lock_free is much slower than mutex, why?

Couple of years ago I was playing comparing lock free vs. mutex (C++11 std::mutex) guarded structures and found that while on OSX, and particularly on MS windows, lock free structures may be significantly faster than mutex guarded (up to 20x), there was virtually no difference on linux. On linux the lock free gained somewhat over mutexes only under high contention.

I guess part of the answer is that Linux uses user space mutexes (futex) that may resort to a system call only if there's a contention. Still, the lock free variants (it was not boost::lock_free) were never slower than mutex based counterparts.

As for the better the machine / worse the performance, it might be due to the inherent costs of memory barriers that get more pronounced with faster hardware and more advanced CPUs.

Cheers,

Leon





Best regards,

dennis


----- 原始邮件 -----
发件人:james <dirtydroog@gmail.com>
收件人:boost-users@lists.boost.org
主题:Re: [Boost-users] the performance of boost::lock_free is slow in centos 6 and 7
日期:2016年07月08日 21点21分

I suspect the main issue with your test is the effects of false sharing being magnified when the number of cores is >= the number of threads.

I'd also suggest using the mm_pause intrinsic when busy spinning if your CPU supports it. (It's a real shame there's no official spinlock class in C++)

Also, use something like a countdown latch to make sure your threads all start the actual work at the same time.


On Fri, Jul 8, 2016 at 1:50 PM, Michael <mwpowellhtx@gmail.com> wrote:


On July 8, 2016 1:39:13 AM EDT, gao1738@sina.com wrote:
> Hi all,
>
>I try the boost::lockfree::queue and find some performance issue:
>
>I use the following test programs:
>
>lock_free_test.cc
>
>#include <boost/thread/thread.hpp>
>#include <boost/lockfree/queue.hpp>
>#include <iostream>
>#include<cstdio>
>
>#include <boost/atomic.hpp>
>
>boost::atomic_int producer_count(0);
>boost::atomic_int consumer_count(0);
>
>boost::lockfree::queue<int> queue(128);
>
>const int iterations = 1000000;
>const int producer_thread_count = 4;
>const int consumer_thread_count = 4;
>
>void producer(void)
>{
>  for (int i = 0; i != iterations; ++i) {
>    int value = ++producer_count;
>    while (!queue.push(value))
>      ;
>  }
>}
>
>boost::atomic<bool> done (false);
>void consumer(void)
>{
>  int value;
>  while (!done) {
>    while (queue.pop(value))
>      ++consumer_count;
>  }
>
>  while (queue.pop(value))
>    ++consumer_count;
>}
>
>int main(int argc, char* argv[])
>{
>  using namespace std;
>  cout << "boost::lockfree::queue is ";
>  if (!queue.is_lock_free())
>    cout << "not ";
>  cout << "lockfree" << endl;
>
>  boost::thread_group producer_threads, consumer_threads;//线程组
>
>  for (int i = 0; i != producer_thread_count; ++i)
>    producer_threads.create_thread(producer);
>
>  for (int i = 0; i != consumer_thread_count; ++i)
>    consumer_threads.create_thread(consumer);
>
>  producer_threads.join_all();
>  done = true;
>
>  consumer_threads.join_all();
>
>  cout << "produced " << producer_count << " objects." << endl;
>  cout << "consumed " << consumer_count << " objects." << endl;
>}
>
>
>locktest.cc
>
>#include <boost/thread/thread.hpp>
>#include <boost/lockfree/queue.hpp>
>#include <iostream>
>#include<cstdio>
>#include <queue>
>
>#include <boost/atomic.hpp>
>using namespace std;
>
>boost::mutex producer_count_mu;
>boost::mutex consumer_count_mu;
>int producer_count = 0;
>int consumer_count = 0;
>
>std::queue<int> message_queue;
>
>boost::mutex queue_mutex;
>
>const int iterations = 1000000;
>const int producer_thread_count = 4;
>const int consumer_thread_count = 4;
>
>void producer(void)
>{
>  for (int i = 0; i != iterations; ++i) {
>    queue_mutex.lock();
>    int value = ++producer_count;
>    message_queue.push(value);
>    queue_mutex.unlock();
>  }
>}

I haven't used lockfree per se but my understanding is that it solves what its name says.

My guess is that most of the time is spent contending for the mutex. Incidentally, why not use one of the proper lock classes? You are already using boost, so this is also there. That'll save you having to lock and unlock, at least.

I haven't explored lockfree that much, I could be wrong, but I thought the whole point of running lockfree was to avoid expensive locks, but not absolving you of being aware of exhausted conditions when your queue was empty.

Also, doing a test like this what are you really asserting? Lock free; not expense free. There are no free lunches. Less so ever before.

Anyhow, HTH

Regards,

Michael Powell

>bool done (false);
>void consumer(void)
>{
>  int value;
>  while (!done) {
>    queue_mutex.lock();
>    while (!message_queue.empty()) {
>      message_queue.pop();
>      ++consumer_count;
>    }
>    queue_mutex.unlock();
>  }
>
>  queue_mutex.lock();
>  while (!message_queue.empty()) {
>    message_queue.pop();
>    ++consumer_count;
>  }
>  queue_mutex.unlock();
>}
>int main(int argc, char* argv[])
>{
>  using namespace std;
>  cout << "boost::lockfree::queue is ";
>//  if (!queue.is_lock_free())
>    cout << "not ";
>  cout << "lockfree" << endl;
>
>  boost::thread_group producer_threads, consumer_threads;//线程组
>
>  for (int i = 0; i != producer_thread_count; ++i)
>    producer_threads.create_thread(producer);
>
>  for (int i = 0; i != consumer_thread_count; ++i)
>    consumer_threads.create_thread(consumer);
>
>  producer_threads.join_all();
>  done = true;
>
>  consumer_threads.join_all();
>
>  cout << "produced " << producer_count << " objects." << endl;
>  cout << "consumed " << consumer_count << " objects." << endl;
>}
>
>
>The compile command is:
>g++ -I/usr/local/inlcude -L/usr/local/lib lock_free_test.cc
>-lboost_thread -lboost_system -o lock_free_test
>g++ -I/usr/local/inlcude -L/usr/local/lib lock_test.cc -lboost_thread
>-lboost_system -o lock_test
>
>1. I first test in on my work computer, which use ubuntu 14.04 with
>2core(i5), with
>boost version: 1.54
>gcc version: 4.8.4
>g++ version: 4.8.4
>
>The test result is that:
>
>time ./lock_test
>
>boost::lockfree::queue is not lockfree
>
>produced 4000000 objects.
>
>consumed 4000000 objects.
>
>
>
>real    0m3.844s
>
>user    0m1.800s
>
>sys    0m12.308s
> time ./lock_free_test
>
>boost::lockfree::queue is lockfree
>
>produced 4000000 objects.
>
>consumed 4000000 objects.
>
>
>
>real    0m1.745s
>
>user    0m6.886s
>
>sys    0m0.000s
>
>
>We can see that the lock free solution has better performance, about
>50%.
>
>2. then I test it in a PC server with centos 6.4 , and 8 core (CPU
>Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz )
>
>boost version: 1.54
>gcc version: 4.4.7
>g++ version:4.4.7
>
>The test result is that:
>
>time ./lock_test
>boost::lockfree::queue is not lockfree
>produced 4000000 objects.
>consumed 4000000 objects.
>
>real    0m3.900s
>user    0m2.593s
>sys    0m27.282s
>
> time ./lock_free_test
>boost::lockfree::queue is lockfree
>produced 4000000 objects.
>consumed 4000000 objects.
>
>real    0m5.470s
>user    0m43.105s
>sys    0m0.000s
>
>
>Non lock free solution is better than lock free solution.
>
>3. I test it in a better PC server with centos 7.1 and 32 core CPU
>(Intel(R) Xeon(R) CPU E7-4820 v2 @ 2.00GHz)
>boost version: 1.53
>gcc version: 4.8.3
>g++ version: 4.8.3
>
>time ./lock_test
>boost::lockfree::queue is not lockfree
>produced 4000000 objects.
>consumed 4000000 objects.
>
>real    0m3.023s
>user    0m1.929s
>sys    0m20.706s
>
>time ./lock_free_test
>boost::lockfree::queue is lockfree
>produced 4000000 objects.
>consumed 4000000 objects.
>
>real    0m9.804s
>user    1m14.900s
>sys    0m0.100s
>
>The lock free solution will be 3 times lower than the non-lock free
>solution!
>
>
>My question is that:
>1. why lock free solution will get better performance in ubuntu but
>much slower in centos 6 and 7?
>    Is it the issue of kernal or the gcc version or the boost version?
>The more cpu in the machine the worse performance for lock free
>solution?
>
>2. In which case, we should use the boost lock free solution to get
>better performance?
>
>
>
>Best Regards!
>
>dennis
>
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Boost-users mailing list
>Boost-users@lists.boost.org
>http://lists.boost.org/mailman/listinfo.cgi/boost-users

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users

_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users


_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users



_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users



--
Thank you.
Sampath Tilakumara