[hub] Porting and pre-review experiments
Hi, Since the review was announced, I've doing some experiments with te proposed data structure. I've collected some notes for technical discussion. Porting and benchmarking -------------------------- I took the hub proposal and did a porting (the initial version was AI-assisted, a very good starting point) that I tuned until the result was good enough for a prototype. It's called "nest" (I could not find a fancy name starting with "h", sorry) here: https://github.com/boostorg/container/blob/develop/include/boost/container/e... with a bench (based on the one provided by hub) here: https://github.com/boostorg/container/blob/develop/experimental/bench_hub.cp... The porting supports all the range of compilers Boost.Container is tested and uses internal Boost.Container utilities. So the first conclusion is that "hub" is perfctly integrable into Boost.Container at the same level as other existing containers with a low effort. Porting performance ------------------- In my local benchmarks the ported container achieves similar performance results, although there are some outliers with some tools (e.g. clang) that I haven't investigated enough. Also, I only tested this on x86/x64, Windows/Linux in my desktop machine, so probably there are rough edges here and there. Benchmarking is certainly magic, I tried to use recommended "clobber()"-like instructions before benchmarked operations, but probably many boost authors would measure this much better than me. Data structure configuration options ----------------------------------- I also added to "nest" configurability options in order to test other performance decisions Joaquín has made: - *prefetch< boolean >* --> Activates prefetching for the nest container to see if prefetching improves benchmark results - *stored_data_in_block< boolean >* --> hub stores 64 values of data independently from the "block" (metadata with the mask and the linked list pointers). When "true", it stores values in the same struct as the metadata (so it allocates metadata + data in the same block) //An example of a modified configuration nest < element //element type , void //default allocator , nest_options_t< prefetch<false>, store_data_in_block<true> >
;
Supporting these options is not complicated, so they could probably be added to "hub" if it's accepted and we can measure some difference on some platforms. ---------------------------------------- Prefetch/stored_data_in_block performance difference ---------------------------------------- Using the above options, it's difficult to see if prefetching or storing data on the block noticeably improves performance in my computer. I see contradictory results depending on the OS (Linux/Windows) or compiler (MSVC/Clang-CL, GCC) I theory I would expect storing data and metadata in the same allocation would help a bit, but I could not have a clear answer. I imagine that results could differ on a deeply embedded machine versus a desktop/server CPU. My conclusion is that the defaults chosen by Joaquín are correct, at least for usual Server/Desktop machines. -------------- Prefetch changes ----------------- I experimented with some prefetch changes in iterator's operator++ https://github.com/boostorg/container/blob/develop/include/boost/container/e... and operator--: https://github.com/boostorg/container/blob/develop/include/boost/container/e... Basically prefetching the first slot with a value in the array for operator++ or the last one in operator-- instead of always prefetching the start of the array. I only see a slight improvement (or maybe it's a place effect). -------------- Local/segmented iterators -------------- I tried to implement efficient segment https://github.com/boostorg/container/blob/develop/include/boost/container/e... and local_iterators https://github.com/boostorg/container/blob/develop/include/boost/container/e... in order to test if hub can take advantage of generic segmented algorithms. The problem I found with my implementation is that the local iterator is nearly as expensive as the general iterator, because it must navigate through the data array using nearly the same bit-tricks. I managed to make the difference between local_iterators (operator-) constant-time (basically counting the number of 1s in the mask between both positions) and that could have a slight advantage if the algorithm does some size-bounded computation (say, algorithms like count_n, fill_n, ...): https://github.com/boostorg/container/blob/develop/include/boost/container/e... In my implementation it's also more efficient to only support positive differences (in the implementation the general implementation supporting negative differences is present but disabled), which is the usual case for iterator ranges in algorithms. Some algorithms I tested could improve a bit with difference operation, but the segmented approach was still more inneficient than using general iterators. Probably Joaquín can improve it and make it sufficiently performant. Hub's iterators are bidirectional but it's possible to make the local_iterator random-access (the needed operations can be made O(1)). However, the performance of my implementation was not good enough IMHO, so probably not very practical. You have the prototype here: https://github.com/boostorg/container/blob/develop/include/boost/container/e... -------------- Summary -------------- - hub's defaults seem correct and performant - I found hub is faster than the plf_hive, as Joaquín claimed. - It's possible to fully integrate hub at the same level as other containers, with moderate effort. A prototype (nest) is already there with similar performance. - I enjoyed reviewing the implementation, more details about my design review in the official review I plan to make. Best, Ion
participants (1)
-
Ion Gaztañaga