Andrey Semashev wrote:
Yes, possibly. In this specific case however most (if not all) of Claude's knowledge comes from Asio and it's BSL.
(Klemens claims that Claude has learned about coroutines from Cobalt. Fortunately, Cobalt is also BSL, so we're safe.)
How do we know that? Do e.g. OpenAI or Anthropic publish the data sets that were used to train their models? Do they make official statements about the source data sets and are their claims verifiable? Would Boost become liable if such claims, if they were made, turn out to be false? Would it become a problem if in the future the models are trained on the new data sets that include not only BSL-licensed code but also under other, incompatible licenses?
As an example, there was the case with Meta AI which was (allegedly?) trained on pirated books. I'm not sure if this ended up proven in court, though, I wasn't following.
You could argue about running a local instance of an LLM that was trained on the data set you carefully prepared, but my understanding is that the majority of users are using third party LLMs that are trained on who knows what.
The only "safe" option I see is that there is a clear rule that whatever licenses were covering the source data set, those licenses do not ever transfer onto the output generated by the model. Whether that rule is guaranteed by copyright law or by the LLM provider (i.e. in case of copyright infringement the LLM provider is bound to take full responsibility). But I do not think such a rule exists, hence why I would like a lawyer's opinion on this.
We can't know any of that, of course. But past experience indicates that worrying about such gray area issues (as far as authoring and distributing open source libraries is concerned), or taking some sort of preventative action in anticipation of hypothetical future problems, is a waste of time and resources. If corporations that have a lot more to lose are incorporating AI-generated code into their codebases, we can pretty reasonably conclude that we can also do it.