• NuXCOM_90Percent@lemmy.zip
    link
    fedilink
    English
    arrow-up
    2
    ·
    19 hours ago

    They don’t “see” characters in inputs, they see words which get tokenized to their own internal vocabulary, hence any questions along the lines of “How many Ms are in Lemmy” is challenging even for advanced, fine tuned models.

    And that is solved just by keeping a non-processed version of the query (or one passed through a different grammar to preserve character counts and typos). It is not a priority because there are no meaningful queries where that matters other than a “gotcha” but you can be sure that will be bolted on if it becomes a problem.

    Again, anything this trivial is just a case of a poor training set or an easily bolted on “fix” for something that didn’t have any commercial value outside of getting past simple filters.

    Sort of like how we saw captchas go from “type the third letter in the word ‘poop’” to nigh unreadable color blindness tests to just processing computer vision for “self driving” cars.

    They can also be tripped up if you simulate a repetition loop.

    If you make someone answer multiple questions just to shitpost they are going to go elsewhere. People are terrified of lemmy because there are different instances for crying out loud.

    You are also giving people WAY more credit than they deserve.