@suy

suy@programming.dev · edit-2 23 days ago

Yeah, you are not gonna be able to do that with an LLM. They will be able to quote only some passages, and only of popular books that have been quoted often enough.

You entirely ignored this part.

You basically proved my point in doing so, BTW. You cannot do what you claimed with an LLM. And I’m not saying, and I never said before “ChatGPT” or “OpenAI”. I don’t understand why you think that I might be “defending these hypocritical companies”, when I literally said the opposite at the end.

You are entirely fooled by the output of ChatGPT and you are not arguing in good faith (or you are entirely unable to understand what I said).

Edit/addendum: And to stress out my point, given that the person to whom I’ve replied to showed the output of ChatGPT as if it were any kind of proof, this is what other LLMs say. This is 4o mini:

Large Language Models (LLMs) like me do not have the ability to quote whole sections of copyrighted texts verbatim. While I can generate text based on patterns and information learned during training, I do not store or recall specific texts or books. Instead, I can provide summaries, analyses, or discuss themes and concepts related to a book without directly quoting it. If you have a specific topic or question in mind, feel free to ask!

And this is Llama 3.1 70B:

Large Language Models (LLMs) can generate text based on the patterns and structures they’ve learned from their training data, which may include books. However, whether they can quote whole sections of a book depends on several factors.

LLMs are typically trained on vast amounts of text data, including books, articles, and other sources. During training, they learn to recognize patterns, relationships, and context within the text. This allows them to generate text that is similar in style and structure to the training data.

However, LLMs do not have the ability to memorize or store entire books or sections of text. Instead, they use the patterns and relationships they’ve learned to generate text on the fly.

That being said, it’s possible for an LLM to generate text that is very similar to a section of a book, especially if the book is well-known or widely available. This can happen in a few ways:

Overlapping patterns: If the book’s writing style, structure, or content is similar to other texts in the training data, the LLM may be able to generate text that resembles a section of the book.

Memorization of key phrases: LLMs may memorize key phrases, quotes, or passages from the training data, which can be recalled and used in generated text.

Contextual generation: If the LLM is given a prompt or context that is similar to a section of the book, it may be able to generate text that is similar in content and style.

However, it’s unlikely that an LLM can quote a whole section of a book verbatim, especially if the section is long or contains complex or unique content. The generated text may be similar, but it will likely contain errors, omissions, or variations that distinguish it from the original text.

Feel free to give them a shot in: https://duck.ai

suy@programming.dev · 2 months ago

Wow, thanks, I have not seen this comment, yet I hinted about this in some of my other replies that I’ve done before.

Yes, I think ML is fair use, but there it would also be fair to force something into the public domain/open source if, in order to be accrued, it has to make use of fair use at unseen amounts of scale.

This would be a difficult to make law, though. Current ML is very inefficient in the amount of data it requires, but it could (and should) be made better.

suy@programming.dev · 2 months ago

Now I sail the high seas myself, but I don’t think Paramount Studios would buy anyone’s defence they were only pirating their movies so they can learn the general content so they can produce their own knockoff.

We don’t know exactly how they source their data (and that is definitely shady), but if I can gain access to a movie in a legal way, I don’t see why I would not be able to gather statistics from said movie, including running a speech to text model to caption it, then make statistics of how many times a few words were used, and followed by which ones. This is an oversimplified explanation of what a LLM does, but it’s the fairest I can come up, and it would be legal to do so. The models are always orders of magnitude smaller than the data they are trained on.

That said, I don’t imply that I’m happy with the state of high tech companies, the AI hype, the energy consumption, or the impact on the humble people. But I’ve put a lot of thought into this (and learning about machine learning for real), and I think this is not a ML problem, but a problem in the economic, legal and political system. AI hype is just a symptom.

suy@programming.dev · 2 months ago

It’s not AI

It’s not AGI, it’s not general intelligence, and it’s not comparable to a human (well, you can compare anything, but human and ML are just very different things in tons of ways).

But it is AI. The ghosts that chase Pacman are AI. A search algorithm is also AI, dammit. Of course an LLM is AI. Any agent that maximizes a function is AI. You are just embarrassing yourself.

suy@programming.dev · 2 months ago

But then it does go on to quote materials verbatim, which shows it’s not “just” ‘extracting patterns’.

Is is just extracting patterns. Is making statistical samples of which token (“word”, informally speaking) is likely followed given the previous stream.

It can only reproduce passages of things it has seen many, many times. I cannot reproduce the whole work. Those two quotes can be seen elsewhere on the internet plenty of times. And it’s fair use there, so it would be fair use with a chat bot as well.

There have been papers published where researchers were able to regenerate an image that was present in the training set of Stable Diffusion. But they were only able to find that image (and others) in particular, because they were present in the training set multiple times, and the caption was the same (it was the portrait picture of some executive at a company).

when given the book and pages — quote copyrighted works

Yeah, you are not gonna be able to do that with an LLM. They will be able to quote only some passages, and only of popular books that have been quoted often enough.

Even if they started to use my service to literally copy entire books?

You cannot do that with an LLM.

Why are you defending massive corporations who could just pay up? Isn’t the whole “corporations putting profits over anything” thing a bit… seen already?

I hate that some corporations are burning money, resources and energy on this, and the solution is not to restrict fair use even further. Machine Learning is complex, but if I had to summarize in some way is “just” gathering statistics of which word comes next (in the case of a text model). This is no different than getting a large corpus of text, and sample it for word frequency, letter frequency, N-gram frequency, etc. It is well known that this is fair use. You only store the copyrighted works to run the software and produce a very transformative work that is a summary many orders of magnitude smaller than the copyrighted work. This is fair use, and it should still be. Changing that is gonna harm the public, small companies and independent researchers way more than big tech companies.

As I said in another comment, I would very much welcome a way to force big corpos to release their models. Make a model bigger than N parameters? You needed too much fair use in one gulp: your model has to be public, and in the public domain. I would fucking welcome that! But going in the opposite direction is just risky.

I don’t understand why small individuals think that copyright is their friend, and will protect them from big tech companies. Copyright will always harm the weak and protect the powerful as a net result. It’s already a miracle that we can enjoy free software and culture by licenses that leverage copyright in our favor.

suy@programming.dev · 2 months ago

“Theft” is never a technically accurate word when dealing with the so called “intellectual property”, because the digital content being copied without authorization is legal in tons of cases, and because, come on, property is very explicitly exclusive. I cannot copy my house or my car, but I can make copies of my works for virtually 0 cost.

Using data for training ML models is even explicitly allowed in some jurisdictions (e.g. Japan), and is likely to be fair use everywhere else. LLMs are very transformative, and while they often can produce verbatim copies of fragments of copyrighted works, they don’t store the whole works or significant pieces of them.

Don’t get me wrong, I don’t like big companies making big money. I would not mind a law that would force models to be open sourced. But restricting them to train their models on public data by restricting fair use, it would harm them very little (they could pay something if they are making some profit), while small researchers or companies would never be able to compete, because they would not have the upfront costs, nor the economic engineering to disguise profits and pay less.

suy@programming.dev · 6 months ago

Excuse me, what?

suy@programming.dev · 6 months ago

I’ve been compiling apps depending on newer Qt and/or kdelibs versions for ages (back when the repository was literally called “kdelibs”, about 20 years ago).

This has never been an issue for me. Even with autoconf/automake, I just compiled everything to its own prefix, so it doesn’t interfere with the system at all. You don’t even need to fix the build system in the cases where it’s broken/lacks features, if you leverage all the “path” variables (CPATH, LIBRARY_PATH, LD_LIBRARY_PATH, PKG_CONFIG_PATH, etc.). But autotools, cmake, qmake, and every build system I’ve used so far supports this out of the box.

Not claiming it’s a skill issue, but I have to say I’m very surprised by reading any of this.

Specifically, for Debian, I was told 20 years ago by a very wise person “you never do make install on Debian, specially not for the kernel”, and taught me how to use make-kpkg (or something like that, I don’t remember the name of the tool), which was a way to make a debian package of a self built kernel, which is obviously something that can’t be installed to its own prefix.

suy@programming.dev · 7 months ago

Related: There is an article on LWN called Lua and Python, which is mostly about the approach of the two languages WRT being “batteries included” or not.

I think Lua being a bit barebones is 100% fine… if you just pair it with a good helper library, or set of libraries with a coherent API, that allows it to thrive. Then you can either use the framework library or not, depending on whether your project requires the extras, or can do without.

As a parallel, I’ve been doing C++ development for almost two decades, and I cannot imagine doing anything non-trivial without Qt. For example, Qt has a debug framework that pretty prints automatically most containers, and adds the newline also automatically. Also, QString is an actual string type, whereas std::string is more like QByteArray. It’s functionality that it’s essential for me (and it’s just the minimal examples… then Qt has all the GUI functionality, of course, but I use Qt even in console-only programs!).

This is surely opinionated on my side, and most C++ devs don’t see it this way, but my point is that a language with a “core experience” that it’s lackluster to you should not be a bad thing if the language is capable enough to provide an ecosystem with a good 3rd party library that adds exactly what you want. In the Lua ecosystem that maybe it’s Penlight.

But I totally get your point. Penlight doesn’t even seem to have a math library, so I found no round implementation there. This can be not a problem for some, but deal breaking for others.

suy@programming.dev · 7 months ago

I’d have to dig it, but I think it said that it added the PID and the uninitialized memory to add a bit more data to the entropy pool in a cheap way. I honestly don’t get how that additional data can be helpful. To me it’s the very opposite. The PID and the undefined memory are not as good quality as good randomness. So, even without Debian’s intervention, it was a bad idea. The undefined memory triggered valgrind, and after Debian’s patch, if it weren’t because of the PID, all keys would have been reduced to 0 randomness, which would have probably raised the alarm much sooner.

suy@programming.dev · 7 months ago

no more patching fuzzers to allow that one program to compile. Fix the program

Agreed.

Remember Debian’s OpenSSL fiasco? The one that affected all the other derivatives as well, including Ubuntu.

It all started because OpenSSL did add to the entropy pool a bunch uninitialized memory and the PID. Who the hell relies on uninitialized memory ever? The Debian maintainer wanted to fix Valgrind errors, and submitted a patch. It wasn’t properly reviewed, nor accepted in OpenSSL. The maintainer added it to the Debian package patch, and then everything after that is history.

Everyone blamed Debian “because it only happened there”, and definitely mistakes were done on that side, but I surely blame much more the OpenSSL developers.

suy@programming.dev · 7 months ago

Is it, really? If the whole point of the library is dealing with binary files, how are you even going to have automated tests of the library?

The scary thing is that there is people still using autotools, or any other hyper-complicated build system in which this is easy to hide because who the hell cares about learning about Makefiles, autoconf, automake, M4 and shell scripting at once to compile a few C files. I think hiding this in any other build system would have been definitely harder. Check this mess:

  dnl Define somedir_c_make.
  [$1]_c_make=`printf '%s\n' "$[$1]_c" | sed -e "$gl_sed_escape_for_make_1" -e "$gl_sed_escape_for_make_2" | tr -d "$gl_tr_cr"`
  dnl Use the substituted somedir variable, when possible, so that the user
  dnl may adjust somedir a posteriori when there are no special characters.
  if test "$[$1]_c_make" = '\"'"${gl_final_[$1]}"'\"'; then
    [$1]_c_make='\"$([$1])\"'
  fi
  if test "x$gl_am_configmake" != "x"; then
    gl_[$1]_config='sed \"r\n\" $gl_am_configmake | eval $gl_path_map | $gl_[$1]_prefix -d 2>/dev/null'
  else
    gl_[$1]_config=''
  fi

suy@programming.dev · 8 months ago

I’ve wanted to start a project in Rust, but for the ideas that I have (and the time that I have for a hobby project, as for work it’s rarely starting a new one, but continuing and existing one), Rust seemed a viable, but not ideal alternative to just doing it all in C++, for which I already have enough knowledge and very well proven libraries. I will look again soon, and I will keep looking because eventually something will surely click, it’s just that so far, the time has not been right.

Note that my point is not that it’s unusable for everyone. Just that it’s false that “some people just can’t seem to let [C or C++] go”, as the previous comment said. I can’t let go something that works well for something that doesn’t, given the projects that I have to work on.

suy@programming.dev · 8 months ago

It’s just time to move on from C/C++, but some people just can’t seem to let go.

The Rust community has 2 websites that I keep periodically checking: Are we game yet? and Are we GUI yet?. The answers on those sites are respectively (as of February 2024, when this comment is written) “Almost. We have the blocks, bring your own glue” and “The roots aren’t deep but the seeds are planted”. I’ve seen the progress in Bevy and Slint, but it’s still the same, those websites don’t change, and my situation WRT to making a Rust project for fun or work it’s the same.

I’ll be happy to start doing Rust projects whenever I get the chance (which will be when it’s a sufficient tool for my use cases). But I’m tired of smoke sellers.

suy@programming.dev · 9 months ago

The very first moment that I had to use JSON as a configuration format, and I was desperate to find a way to make a long string into a JSON field. JSON is great for many things, but it’s not good at all for a configuration format where you need users to make it pretty, and need features like comments or multi-line strings (because you don’t want to fix a merge conflict in a 400 character-wide line).

suy@programming.dev · 9 months ago

Doesn’t YAML have a (seldom used) feature of a start and end of document marker? The “YAML frontmatter” that a few markdown documents have, uses this.

suy@programming.dev · 1 year ago

Radon, the “handwriting” one, seems like if someone wanted to have Comic Sans but for code.

suy@programming.dev · 1 year ago

I heard the rumor that Linux desktop environments use it too. Now hopefully multimedia apps with 3 letters like VLC and OBS can adopt it too.

j/k