Discussion in 'Site Feedback' started by Gameasy, Mar 28, 2024.

  1. Hvast

    Hvast Really Really Experienced

    Your remark isn't even related to my argument

    AI is not the subject of the law. People who make AI are. And taking text\images from the internet and using them in training is just a new type of fair use.

    Remember, it is legal to "profit from other's work" when you sell your own work. Selling your review, parody, or studying material is still a legal fair use. And the author or another copyright holder cannot forbid such use as long as it is transformative enough.
    TheLowKing likes this.
  2. TheLowKing

    TheLowKing Really Really Experienced

    As an author, I do have the right to choose who does or doesn't get to use my work and in what way. That's what copyright is. You're right, there are limits to my control over my work: parodies and reviews are fair-use. That's not what AI is used for, though. AI creates derivative works of all types, and no AI has ever categorically refused to finish a prompt because it might violate copyright (and no AI company has ever tried). Just because you call it fair use doesn't mean it is. Citation needed. :p

    Also, AI training (specifically, training of LLMs, which is the current hype) is not "the same" as human learning. They are very, very different processes. Thankfully! If they weren't, kids would use terawatt hours of electricity, emit more CO2 than a fleet of BMWs, and require continuous and instant access terabytes of text. Each!
    Last edited: Mar 31, 2024
  3. Hvast

    Hvast Really Really Experienced

    Copyright is, by the very name, about the right to COPY. It is about the prohibition of making copies in one way or another.

    No, you don't have any right to decide what other people do with your product as long as they get it in a legal way. And it is legal to take it from an open-access source you placed it in.

    But it has nothing to do with the training of AI and selling that model to the public. What's wrong here is producing content that is too similar to copywrited one. It is the result of mis(use) of the AI model.
    Disney has the right to sue my ass if I use AI model to draw Mickey Mouse and sell it. It has no right to stop me from putting the Mouse in training data. (They will TRY to get such right, but it is so against the spirit of Copyright Laws that if this will be allowed, it will be horrible)

    The difference is irrelevant. As I said, AI is not even a subject there. Neither are students(unless self-teaching). Teacher\AI maker are the subjects of evaluation of the accordance of their actions to copyright laws.


    PS. Fun fact, If I'll go to some language model and ask it - "write me a story in which a Twi'lek and a Hobbit fuck." I (not the one who made or provided the model) am breaking copyright. Both of those races are copywrited and me creating a single copy for my own consumption is still an infringement. Good luck enforcing that, of course.
    Last edited: Mar 31, 2024
  4. xmare

    xmare Virgin

    This is a huge improvement, thank you very much.

    As for AI, putting academic stuff aside, I'd expect that to be an opt-in/out thing. I would happily consider opting in because I don't make any money from my work, so the idea of people easily creating more of it sounds like a good thing to me.
  5. gene.sis

    gene.sis CHYOA Guru

    I'm not an expert on LLM, so feel free to correct me!

    I don't think fair-use matters in any way when it comes to LLM-generated content.

    LLMs don't have libraries with all the data they have been trained on. They just "read" the content, figure out the relationship between the single tokens, and work that relationship into the model.
    If your work was part of an LLM's training data and a line from your work appears in content created by that LLM, the LLM didn't copy that from your work. It is more likely that the line followed a very common pattern and used the same tokens "by accident" (though by accident rather means that the preceding context within the LLM-created content and possibly the prompt made it more likely that the combination occurred.) Or the line was part of many sets of training data, so it might likely be a common phrase. Another option might be that it has been copied in the first place.

    Probably not. But I'm not sure whether you could prompt an LLM to deliberately and reliably create copyright-violating content.
    For an LLM to know whether the content they created might violate copyright, it might need to compare the created content with the data they have been trained on.

    If you put content on a site that allows other users to read the content, it is legal for other users to read the content.
    That doesn't mean that it is legal to crawl the site.
    So personally, I wouldn't say it is legal by default.

    The issue might be that you can't prove that your content has been used, unless the LLM trainer keeps records of the content they used for training.

    As far as I know, we understand about as much about the human brain as we do understand about how LLMs learn.
    So the process might not be so much different, even though the current hardware most likely is. (Well, the scope of human learning is way broader and adds context from more areas. Humans might also keep a limited library as a resource.)

    A three-dimensional computing structure with built-on-the-fly hard-wired combinational logic might be more efficient than doing countless mathematical operations on a GPU.
    So if LLMs ran on a different hardware, they might be way more energy efficient.

    From a practical point of view, I don't think it really matters on what exact content the LLM is trained as long as there are many datasets that aren't too similar.
    It might be different if you train an LLM only on a certain type of content. E.g. if you train it on Tweets only, the output might appear to be like a Tweet.
  6. Friedman

    Friedman Administrator

    I was not asked about AI training and therefore did not give permission to use CHYOA's content. Given that all the major LLM models censor or do not allow pornography, there doesn't seem to be a focus here, but I could be wrong.

    I played around with AI generators (Dreampress to be precise) for erotic stories online, but wasn't particularly impressed. In my view, a human author continues to beat AI when it comes to (erotic) stories. Planning a story and steering it in a certain direction requires vision and creativity.
    Toby Mark likes this.
  7. grimbous

    grimbous Really Experienced

    Sounds like you haven't yet but only because you weren't asked and that you would give permission if you were asked. Am I reading that correctly?
  8. Friedman

    Friedman Administrator

    I haven't thought about it until now. The stories are the property of their respective authors, so I don't see myself in a position to grant permission.
    Toby Mark, TheLowKing and grimbous like this.
  9. grimbous

    grimbous Really Experienced

    Awesome. Thank you Friedman. :)
    Friedman likes this.
  10. TheLowKing

    TheLowKing Really Really Experienced

    I stopped reading here. I try very hard to avoid being rude to people in online discussions, because the world is shitty enough as it is, but I'm not sure how I can convey just how out of your depth you are without speaking some blunt truths. Everything you said is wrong. If this is your understanding of what copyright is, then you should read up on it, because all you're doing now is confidently stating falsehoods, and it's painful to read. As always, Wikipedia is a great place to start.

    Agreed. Training an LLM to only respect copyright or only infringe on it is essentially the same process with the sign flipped. You don't need to do either of those things for it to infringe on copyright, though: if your process generates 20 non-infringing and 5 infringing outputs then you're running afoul of the law regardless of whether that process is an AI or an office full of desk clerks.

    I'm less concerned with whether or not my individual work can be word-for-word reproduced (though I have seen instances of this happening) than I am with my and others' work being used by for-profit companies without my consent, whether through AI or otherwise.

    These are some of the largest corporations in the world (Google, Microsoft, Facebook, etc), trawling the Internet for whatever work they can get their hands on, without regard for licensing, without asking permission, without offering any kind of renumeration. They then run all that work through a meat grinder and offer people access to it to in exchange for vast sums of money. I think that's wrong.

    In fact, my impression is that human learning is even less well-understood than LLM training. When I brought up things like power and data use, my goal was to point out that we can be pretty confident that they are different processes even if we don't understand them at all, just by looking at the inputs and outputs: if 3 + <some complicated formula> = 9 and 1 + <a different complicated formula> = 13, then we don't need to work out the complicated formulae to know that they are different.

    If a different process of training LLMs was invented that required orders of magnitude less power and data, then that might be closer to the way humans learn. Or it might not.
  11. Hvast

    Hvast Really Really Experienced

    Wikipedia you say...

    A copyright is a type of intellectual property that gives the creator of an original work, or another right holder, the exclusive and legally secured right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time

    Exactly what I said in the part you quoted. Distribute, adapt, display and perform are merely other forms of copying. If you invoke Wikipedia (not a reliable source), check that it supports what you say.
  12. Hvast

    Hvast Really Really Experienced

    Yes, it is extremely hard to find if a specific copyrighted text was used in training. But it is extremely easy to check if copyrighted characters (or other copyrighted concepts) are in the training data. If you ask a language model "Who is Luke Skywalker" and it consistently gives you anything close to a correct answer - He was in the training data.

    If people who want to destroy the spirit of the copyright will win and make it mandatory for model trainers to get permission from copyright holders, some corporations will become far richer because good luck avoiding Luke Skywalker in your training data... I think you can just run a script and delete all mentions of him but then the usefulness of the model and quality of training data will go down significantly.
  13. gene.sis

    gene.sis CHYOA Guru

    You still can't prove that your (or even any) copyrighted data about the character was used as training data.
    There might be movie reviews that were part of the training data or opinions from forum users. (Even though that might still be copyrighted content that got used with or without permission.)

    On the other hand, an LLM might get linked to a search engine, so it can get actual information, analyze it, and offer you a conclusion.
    That might lead to similar issues and might adversely affect the concept of search engines.

    The main thing might still be that a real person would basically do the same as an LLM. They would only go a step further and follow a search result to the source page where they pay by ad consumption or other means. (And that's likely one of the main reasons why sites even allow search engines to crawl their sites.)
    Hvast and TheLowKing like this.
  14. TheLowKing

    TheLowKing Really Really Experienced

    But it would sure be suspicious if 17 names that I made up myself for specifically my fantasy story, names that I specifically made sure no one else ever used, suddenly showed up in the output of an LLM. And I have seen exactly this happen more than once: the LLM would generate a cool-sounding name, I would look it up and get exactly 1 result: some story on Wattpad or a similar place. I don't know if that would hold up in a court of law, but personally that would convince me that my work was used to train the LLM. (OK, not with 17 names. But I have seen 2 names from the same story.)
  15. Hvast

    Hvast Really Really Experienced

    Irrelevant. The character itself is the copyrighted object here, not the work containing it.

    Yes, people discussing Luke Skywalker on a forum do it in a fair use way but it doesn't mean that using those discussions in the training data is fine if we conclude that using anything in the training data is a form of copying that doesn't fall under the fair use.

    If I extract someone's property from fair use sources (like reviews or studying material) and start using it in no fair use applications - I am breaking copyright just as much as if I took it directly from the copyright holder.
    raziel83 likes this.