[hub] Porting and pre-review experiments
Hi, Since the review was announced, I've doing some experiments with te proposed data structure. I've collected some notes for technical discussion. Porting and benchmarking -------------------------- I took the hub proposal and did a porting (the initial version was AI-assisted, a very good starting point) that I tuned until the result was good enough for a prototype. It's called "nest" (I could not find a fancy name starting with "h", sorry) here: https://github.com/boostorg/container/blob/develop/include/boost/container/e... with a bench (based on the one provided by hub) here: https://github.com/boostorg/container/blob/develop/experimental/bench_hub.cp... The porting supports all the range of compilers Boost.Container is tested and uses internal Boost.Container utilities. So the first conclusion is that "hub" is perfctly integrable into Boost.Container at the same level as other existing containers with a low effort. Porting performance ------------------- In my local benchmarks the ported container achieves similar performance results, although there are some outliers with some tools (e.g. clang) that I haven't investigated enough. Also, I only tested this on x86/x64, Windows/Linux in my desktop machine, so probably there are rough edges here and there. Benchmarking is certainly magic, I tried to use recommended "clobber()"-like instructions before benchmarked operations, but probably many boost authors would measure this much better than me. Data structure configuration options ----------------------------------- I also added to "nest" configurability options in order to test other performance decisions Joaquín has made: - *prefetch< boolean >* --> Activates prefetching for the nest container to see if prefetching improves benchmark results - *stored_data_in_block< boolean >* --> hub stores 64 values of data independently from the "block" (metadata with the mask and the linked list pointers). When "true", it stores values in the same struct as the metadata (so it allocates metadata + data in the same block) //An example of a modified configuration nest < element //element type , void //default allocator , nest_options_t< prefetch<false>, store_data_in_block<true> >
;
Supporting these options is not complicated, so they could probably be added to "hub" if it's accepted and we can measure some difference on some platforms. ---------------------------------------- Prefetch/stored_data_in_block performance difference ---------------------------------------- Using the above options, it's difficult to see if prefetching or storing data on the block noticeably improves performance in my computer. I see contradictory results depending on the OS (Linux/Windows) or compiler (MSVC/Clang-CL, GCC) I theory I would expect storing data and metadata in the same allocation would help a bit, but I could not have a clear answer. I imagine that results could differ on a deeply embedded machine versus a desktop/server CPU. My conclusion is that the defaults chosen by Joaquín are correct, at least for usual Server/Desktop machines. -------------- Prefetch changes ----------------- I experimented with some prefetch changes in iterator's operator++ https://github.com/boostorg/container/blob/develop/include/boost/container/e... and operator--: https://github.com/boostorg/container/blob/develop/include/boost/container/e... Basically prefetching the first slot with a value in the array for operator++ or the last one in operator-- instead of always prefetching the start of the array. I only see a slight improvement (or maybe it's a place effect). -------------- Local/segmented iterators -------------- I tried to implement efficient segment https://github.com/boostorg/container/blob/develop/include/boost/container/e... and local_iterators https://github.com/boostorg/container/blob/develop/include/boost/container/e... in order to test if hub can take advantage of generic segmented algorithms. The problem I found with my implementation is that the local iterator is nearly as expensive as the general iterator, because it must navigate through the data array using nearly the same bit-tricks. I managed to make the difference between local_iterators (operator-) constant-time (basically counting the number of 1s in the mask between both positions) and that could have a slight advantage if the algorithm does some size-bounded computation (say, algorithms like count_n, fill_n, ...): https://github.com/boostorg/container/blob/develop/include/boost/container/e... In my implementation it's also more efficient to only support positive differences (in the implementation the general implementation supporting negative differences is present but disabled), which is the usual case for iterator ranges in algorithms. Some algorithms I tested could improve a bit with difference operation, but the segmented approach was still more inneficient than using general iterators. Probably Joaquín can improve it and make it sufficiently performant. Hub's iterators are bidirectional but it's possible to make the local_iterator random-access (the needed operations can be made O(1)). However, the performance of my implementation was not good enough IMHO, so probably not very practical. You have the prototype here: https://github.com/boostorg/container/blob/develop/include/boost/container/e... -------------- Summary -------------- - hub's defaults seem correct and performant - I found hub is faster than the plf_hive, as Joaquín claimed. - It's possible to fully integrate hub at the same level as other containers, with moderate effort. A prototype (nest) is already there with similar performance. - I enjoyed reviewing the implementation, more details about my design review in the official review I plan to make. Best, Ion
El 20/04/2026 a las 23:15, Ion Gaztañaga escribió:
Hi,
Since the review was announced, I've doing some experiments with te proposed data structure. I've collected some notes for technical discussion.
[...]
---------------------------------------- Prefetch/stored_data_in_block performance difference ----------------------------------------
Using the above options, it's difficult to see if prefetching or storing data on the block noticeably improves performance in my computer. I see contradictory results depending on the OS (Linux/Windows) or compiler (MSVC/Clang-CL, GCC)
I theory I would expect storing data and metadata in the same allocation would help a bit, but I could not have a clear answer. I imagine that results could differ on a deeply embedded machine versus a desktop/server CPU.
My conclusion is that the defaults chosen by Joaquín are correct, at least for usual Server/Desktop machines.
-------------- Prefetch changes -----------------
I experimented with some prefetch changes in iterator's operator++
https://github.com/boostorg/container/blob/develop/include/boost/container/e...
and operator--:
https://github.com/boostorg/container/blob/develop/include/boost/container/e...
Basically prefetching the first slot with a value in the array for operator++ or the last one in operator-- instead of always prefetching the start of the array. I only see a slight improvement (or maybe it's a place effect).
Regarding prefetching and storing data in block or separately, I did my own experiments and settled down for what's been proposed, though of course some more experimenting can't hurt. Separate data in particular was a noticeable win for some element sizes and environments, which I attribute (without proof) to block headers and data blocks ending up in different per-size pools in the allocator, with a resulting improvement in cache locality for both.
-------------- Local/segmented iterators --------------
I tried to implement efficient segment
https://github.com/boostorg/container/blob/develop/include/boost/container/e...
and local_iterators
https://github.com/boostorg/container/blob/develop/include/boost/container/e...
in order to test if hub can take advantage of generic segmented algorithms. The problem I found with my implementation is that the local iterator is nearly as expensive as the general iterator, because it must navigate through the data array using nearly the same bit-tricks.
At this microlevel, it's hard to beat a local mask-based loop as boost::container::hub uses. I noticed that you check for empty segments here: https://github.com/boostorg/container/blob/develop/include/boost/container/e... ("if (BOOST_LIKELY(m != 0))".) The check is not needed: boost::container::{hub|nest} only traverses non-empty segments.
[...]
-------------- Summary -------------- - hub's defaults seem correct and performant
- I found hub is faster than the plf_hive, as Joaquín claimed.
- It's possible to fully integrate hub at the same level as other containers, with moderate effort. A prototype (nest) is already there with similar performance.
- I enjoyed reviewing the implementation, more details about my design review in the official review I plan to make.
Looking forward to your review, Joaquín M López Muñoz
El 21/04/2026 a las 18:25, Joaquin M López Muñoz via Boost escribió:
At this microlevel, it's hard to beat a local mask-based loop as boost::container::hub uses.
Very hard. I was crazy trying to know why nest was behind in the for_each bench without luck. It turns that doing: pbb->mask & (full << 1 << std::size_t(n)); vs (note no std::size_t conversion on n, which is an int) pbb->mask & (full << 1 << n); was a 5% improvement because converting int to uint64 before shifting was a noticeable impact, I was quite shocked. I think the porting performance is on par with the proposal in my machine, in case you want to test it: https://github.com/boostorg/container/blob/develop/experimental/bench_hub.cp... Ion
participants (2)
-
Ion Gaztañaga -
Joaquin M López Muñoz