[boost] Re: Capy: Request for endorsements

17 Feb 2026

      On Tue, Feb 17, 2026 at 11:13 AM Peter Dimov via Boost <
boost@lists.boost.org> wrote:
...
...
Disclaimer: Claude Opus 4.6 wrote this:

How Transformers Work (Brief)

A transformer is a neural network architecture that learns statistical
relationships between tokens (subword units). During training, it adjusts
billions of floating-point weights via gradient descent to minimize
prediction error on the next token. The result is a compressed, lossy
representation of patterns in the training data — not a database of
documents.

Key points:

Attention mechanism. Self-attention lets each token attend to every other
token in the context window, computing weighted relevance scores. This is
how the model captures long-range dependencies — syntax, argument
structure, style — without storing literal sequences.

Weights ≠ storage. A model with ~100B parameters trained on trillions of
tokens cannot store those tokens verbatim. The information is destructively
compressed. It's more analogous to how a human programmer who has read a
lot of Asio code might unconsciously reproduce idioms and patterns, rather
than a photocopier.

Memorization does happen, but it's the exception. Research (Carlini et al.,
"Extracting Training Data from Large Language Models") has shown that LLMs
can regurgitate verbatim snippets, particularly of data that appeared many
times in training or is highly distinctive. Short, unique sequences (API
keys, specific code blocks) are more susceptible. But for typical code, the
output is a probabilistic reconstruction, not recall.

What This Means for the Copyright Question
Dimov's analysis is roughly correct. The real risks are:

Verbatim reproduction — possible but unlikely for non-trivial code blocks.
The longer the sequence, the less likely it's memorized exactly. Modern
models also apply deduplication and guardrails to reduce this.

Structural copying — a model might reproduce the architecture or design
pattern of a copyrighted work without copying literal text. This is harder
to adjudicate. Copyright protects expression, not ideas, so reproducing an
API design or algorithmic approach is generally not infringement.

The BSL angle — Dimov's point that Asio (and Cobalt) are BSL-licensed is
pragmatically relevant. Even if a model did memorize and reproduce
fragments, the source material's permissive license weakens any
infringement claim substantially, since the copyright holder has already
granted broad usage rights.

Rivera Morell's concern about license incompatibility is the more subtle
issue: if a model blends patterns from BSL and GPL sources into a single
output, what license applies? This is genuinely unresolved law.

Bottom line: Transformers don't "remember" documents the way a database
does. They learn compressed statistical patterns. Verbatim reproduction is
possible but empirically rare for code, and the risk is further mitigated
when training data is permissively licensed. The harder open question is
about structural similarity and license mixing, which no court has
definitively addressed yet.

Thanks

[boost] Re: Capy: Request for endorsements

Vinnie Falco