Reflections on accuracy in AI-assisted reviews

Thanks to everyone who’s taken the time to comment on the Boost.SQLite draft. The feedback has been sharp and helpful, and it also points to a broader challenge: accuracy in AI-assisted contributions. Sooner or later, someone will say that “hallucinations” make this kind of work impossible. But if we look at domains like law or medicine, the bar is already clear: no degree of inaccuracy is acceptable. That’s true whether the source is human or machine. In both cases, what matters is the process we put around the work—fact-checking, red-teaming, and review. Believe it or not (“low effort AI slop”) quite a lot of HITL (human in the loop) took place before the decision to publish made the call to publish. Not enough — on this occasion — we can agree. I should have phrased this better in the draft. My aim isn’t to excuse errors, but to emphasize that a useful workflow combines research with adversarial review. Trusted agents won’t emerge from a single prompt—they’ll be shaped over time, with heavy human-in-the-loop checks at the start and automated cross-checks as things mature. Reuters, for example, developed a legal research assistant this way: a year of iterative review by lawyers until its answers were consistently trustworthy. I’ll continue refining AI research generation with this in mind: less “AI produced,” more “AI assisted, human reviewed, evidence logged.” The goal is to get closer to the accuracy standard that applies to all contributions here, whether typed by a person or assisted by a tool.

Hi Sergio, Thank you for your exploration. I do believe AI is useful, at least for me when I'm too tired to write proper english (or too irritated to have lean communication). I have not understood what your objective is in this thread. Would you mind rephrasing what you are trying to achieve? Kind regards, Arno

Thanks again for engaging. To clarify my objectives in this thread: 1. Explore AI’s role in research Use AI to help compile evidence on Boost libraries and their competitive landscape, always paired with human review and red-teaming. 2. Generate value for library authors Insights might help maintainers as they consider roadmaps, adoption strategies, or positioning relative to other work. 3. Prototype a repeatable workflow A cycle we can refine over time: - Evidence collection (gather sources, benchmarks, docs) - HITL curation (filter, annotate, prioritize) - Compilation & drafting (AI produces a first-pass narrative from curated sources) - Draft red-teaming (adversarial review to catch errors, weak claims, omissions) - Further HITL curation (reorganize, elevate, cut sections, guide re-drafts) - Finalization (executive summary last, human sign-off before publish) 4. Connect to Alliance / research-labs This is the umbrella where we’re running AI-first projects, with custom code as required for integration and experience. The JSON libraries pilot is just the first; the aim is to learn what kinds of briefs and comparisons are most useful to the community. 5. Surface ongoing signals Beyond one-off reports, there’s an opportunity to categorize discussion-board posts and repo PRs not only as news but also as metadata enrichment for other insight needs. Both streams could be valuable to Boost Libraries and could feed into promotional vehicles, keeping them evergreen and analytically richer. An afterthought: perhaps starting with a clearer introduction would have been better. It’s not my first rodeo with skepticism or resistance to AI-assisted tasks. But I’m learning from each effort — and it keeps this 35+ year solution architect young, and stretching just beyond the comfortable. On Tue, Sep 2, 2025 at 12:23 PM Arnaud Becheler <arnaud.becheler@gmail.com> wrote:
Hi Sergio,
Thank you for your exploration. I do believe AI is useful, at least for me when I'm too tired to write proper english (or too irritated to have lean communication). I have not understood what your objective is in this thread. Would you mind rephrasing what you are trying to achieve?
Kind regards, Arno

I believe Arnaud was asking for clarification, not further obfuscation. Also, top-posting is considered poor etiquette on the mailing list.

[This space intentionally left blank] On Tue, Sep 2, 2025 at 1:11 PM Janko Dedic <jankodedic2@gmail.com> wrote:
I believe Arnaud was asking for clarification, not further obfuscation. Also, top-posting is considered poor etiquette on the mailing list.
Well, not everyone will find my response clarifying. I was just admonished for top-posting (not knowing what that is perhaps not being an excuse for rule violation, but relatable at least) so I will adopt the best practice quickly. Vinnie gave me a couple of tightly written points that nicely rewrite some of what I meant: “two possible areas of ongoing benefit: a per-review-announcement explanation of the problem space (e.g. the need for sqlite wrappers), and a per-review-announcement enumeration and description of competing libraries” describing the current project.

I see. I got myself into top-posting trouble before too, it’s another counter intuitive aspect of the mailing list. Fortunately Vinnie took the time to send me some screenshots on Slack to make sur I got it. I like your idea of using AI to present the problem space of a new library and a list of competing libraries. I agree it’s one thing where AI can help. My take is that AI should alleviate our work to allow us focusing on the essential. So if AI can generate all of these heavy and inexact and lengthy details, our work becomes to summarize them, verify them and make sure they are digestible and scannable by the community. To that end I still believe a bunch of templates could help, to enforce a format and a structure (paragraphs, length, word count, content etc) inside which we could i) use AI to populate ii) use human common sense and creativity to make it more digestible, more trustworthy and more community-oriented (AI humour sucks). So I guess it’s a balance to find. I don’t think directly feeding AI output to human communities is very efficient (presently, things change fast). Kind regards, Arno

I like your idea of using AI to present the problem space of a new library and a list of competing libraries. I agree it’s one thing where AI can help.
Is this valuable during the review? I would argue no. The giant AI generated post read like a relatively standard competitive analysis. That should be looked at much earlier by the author in the development lifecycle of the library. The expectation of reviewers is that they are not newcomers to the area of review, and as we have seen they often reference the behavior of other libraries. I think this kind of analysis is more useful in the documentation of the library. In this specific case Klemens has already provided to an extent, and is seen across other libraries (e.g. Boost.JSON, Boost.unordered, etc.). Matt

Is this valuable during the review? I would argue no. The giant AI generated post read like a relatively standard competitive analysis.
I agree, this should be either in the documentation, or in the RM’s email of review announcement (in a much shorter form). I do not think such AI product should be part of individual review for the simple reason that if it’s easily generated, then each reviewer using AI is at risk at obfuscating the discussion with redundant information :) So maybe there is a hint here for discouraging/banning use of AI for reviewers. Kind regards, Arno

There was a saying I once heard about AI which is that:
AI is for creative people to do tedious things, not for tedious people to do creative things.
In this case, the Boost review is the creative piece and there's not much tedium in writing one. In fact, one should actually _enjoy_ writing a Boost review. It should be an act of care and passion. Maybe the tedium is in proofreading it or something like that. But generating a review really defeats the purpose, which is why it didn't land. - Christian

On Wed, Sep 3, 2025 at 7:40 AM Christian Mazakas via Boost < boost@lists.boost.org> wrote:
...generating a review really defeats the purpose, which is why it didn't land.
Yes and I think I should have been more clear: AI will NEVER write a meaningful review which includes an acceptance outcome. For the simple reason that the attention strategy of LLMs is incapable of exercising the creative discretion which has been and is still currently an exclusive feature of human cognition. However these systems are still creating value in many industries and our goal with this research is to figure out how it might create value for Boost. It is clear that the generative output provokes strong emotions, for predictable reasons. No one wants to be replaced. I think this is an irrational fear, for Boost at least, I don't see any current or near-future technologies capable of doing the type of work that we do. Given the sharply negative reaction I think we will put this project on hold and continue exploring how AI might assist us in other ways. For example, we are building an agent to analyze the entire archive of mailing list posts and propose topical keywords. And also to categorize each post, in particular to identify conversations tied to formal reviews so we can index and display such posts on the corresponding Library page. Hopefully this will be non-controversial, and there will be no visible generated AI content that might annoy people. If anyone is interested in participating in this and other projects please reach out, this is an exciting area of research and I hope we can get some concrete results which will be positive for Boost. Thanks

Le mercredi 03 septembre 2025 à 08:01 -0700, Vinnie Falco via Boost a écrit :
Given the sharply negative reaction I think we will put this project on hold and continue exploring how AI might assist us in other ways.
My feeling on that is that the experiment went wrong for the following reasons * the analysis was sent in the middle of the review, making it look like a review whereas it is not one * the tone of the analysis is véry verbose, so ai-looking * the analysis itself was not reviewed enough, and contained some falsehood statements, which may not be obvious to the casual user. Reviewers are not necessarily sqlite experts. In my opinion, having such an overview of the current field is valuable, but it needs: * to be done before the review, so that each reviewer can refer to it * be done or at least corrected / reviewed by the author of the library. Analyzing competitors and comparing the library has, as far as i remember, always been a request in reviews. AI can help in doing that step. Or maybe not. It's just a tool. The good question about the experiment is not whether AI generated content is relevant. It's how valuable Sergio's message was for reviewers, and for the review manager. In my opinion, there's some value in what has been posted, but the message is too verbose, the timing was wrong, and the presence of errors embarassing. It would probably have been better if it had been done before the review, phrased in a more concise way, and corrected by the library author.
For example, we are building an agent to analyze the entire archive of mailing list posts and propose topical keywords. And also to categorize each post, in particular to identify conversations tied to formal reviews so we can index and display such posts on the corresponding Library page.
This looks like a nice project. Thanks for experimenting in these areas as well. Regards, Julien

Am 03.09.25 um 09:55 schrieb Matt Borland via Boost:
I like your idea of using AI to present the problem space of a new library and a list of competing libraries. I agree it’s one thing where AI can help. Is this valuable during the review? I would argue no. The giant AI generated post read like a relatively standard competitive analysis. That should be looked at much earlier by the author in the development lifecycle of the library. The expectation of reviewers is that they are not newcomers to the area of review, and as we have seen they often reference the behavior of other libraries. I think such an "analysis" could be valuable at the start of the review. There were some corrections on the facts in that post. To my understanding: One looks like a nitpick, one as additional information, one real inaccuracy fixed though it was not completely wrong and I don't agree with the JSON point.
So it doesn't seem to bad although it might be better if the library author would have had a look on the summary too. I think having such an overview/summary might make it easier for reviewers to focus on specific parts / get ideas what to look at. So I'm not fully opposed to such a practice. Might be worth trying again and refining the process before fully abandoning it.

On Thu, Sep 4, 2025 at 11:37 AM Alexander Grund via Boost < boost@lists.boost.org> wrote:
I think having such an overview/summary might make it easier for reviewers to focus on specific parts / get ideas what to look at.
Publishing was a misstep. Entirely mine. Timing. Unrealistic Scope. And not enough red team. We should have slowed down. Moving too fast and breaking things is not being of value to anyone. I share the above vision. The vision to support reviewers, not to — even in appearance — `replace` anyone. How to better do that will be the retrospective of this iteration. Earlier prep, tighter human editing, clearer separation from review process. Insights and critiques are welcome. _______________________________________________
Boost mailing list -- boost@lists.boost.org To unsubscribe send an email to boost-leave@lists.boost.org https://lists.boost.org/mailman3/lists/boost.lists.boost.org/ Archived at: https://lists.boost.org/archives/list/boost@lists.boost.org/message/MQ2TYXCG...
participants (8)
-
Alexander Grund
-
Arnaud Becheler
-
Christian Mazakas
-
Janko Dedic
-
Julien Blanc
-
Matt Borland
-
Sergio DuBois
-
Vinnie Falco