Almost every website and services are getting scraped at alarming rate, are Lemmy servers facing this issue?

Please share mitigations you’ve seen applied to this.

  • safesyrup@feddit.org
    link
    fedilink
    English
    arrow-up
    30
    ·
    4 months ago

    I think lemmy content is scraped too, just how the whole web is beeing scraped. I do not have any proof for it though.

    I have seen a user add a like anti-commercial AI license as a footer for every comment he writes lol

    • SSUPII@sopuli.xyz
      link
      fedilink
      arrow-up
      16
      ·
      edit-2
      4 months ago

      Those are truly useless to go against bad actors and is instead only annoying for the humans that read. And good actors with proper licenses won’t be scraping Lemmy, Reddit or Twitter.

      You just cannot prevent it on Lemmy because if an instance places filters like Anubis, another will not. And it is not feasable to mandate every instance to do so. Also, this is an open platform by nature and there is no group or company that can mandate rules of access. As you are limiting non-humans, you might also be limiting real users with peculiar configurations or under heavy privacy middlewares.

      • Captain Beyond@linkage.ds8.zone
        link
        fedilink
        arrow-up
        4
        ·
        4 months ago

        The point (as I see it) is not so much to stop scraping as it is to prevent bots from effectively DDOS-ing web services. As others have said ActivityPub content is public and there are ways to get it without slamming instances with scraper bots.

    • potatoguy@potato-guy.space
      link
      fedilink
      arrow-up
      15
      ·
      4 months ago

      It is, I saw claudebot and gptbot scraping my instance, made a post about it on fuckai, but i have blocked all these bots now and my instance is a lot faster.

      • Forester@pawb.social
        link
        fedilink
        English
        arrow-up
        9
        ·
        4 months ago

        Out of curiosity, I am not familiar with the stack that runs the behind the scenes at all for lemmy. Are you blocking IP ranges or something else?

    • axby@lemmy.ca
      link
      fedilink
      arrow-up
      2
      ·
      4 months ago

      I don’t host a Lemmy instance, but I post links in my comments. I sometimes generate and share unique-ish URLs to share updates with specific versions of my hobby projects. I’ve seen them queried a few times in my Apache logs by useragents claiming to be from OpenAI, Anthropic, etc. Also search engine crawler bots.

      Here’s the IP whose useragent claimed to be an Anthropic bot, seems like others have encountered the same behaviour: https://abuseipdb.com/check/216.73.216.135

  • ramble81@lemmy.zip
    link
    fedilink
    arrow-up
    20
    arrow-down
    1
    ·
    edit-2
    4 months ago

    They don’t really need to scrape. They just have to set up their own federated instance and the ActivityPub protocol will willingly hand it all to them in a nicely parsable format.

  • CaptainBasculin@lemmy.ml
    link
    fedilink
    arrow-up
    4
    ·
    4 months ago

    It’s very easy for any activitypub content to be scraped, all servers practically serve the content on a silver platter to any federated server.

  • Lemuria@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    ·
    4 months ago

    I’m sure the AI devs so lazy they cannot train their AI on anything other than scraped HTML can set up a Lemmy instance and point their crawlers at that.