How OpenAI's bot took down this seven-person company's website 'like a DDoS attack'

on Saturday Triplegangers CEO Oleksandr Tomchuk warned his company's e-commerce site was shutting down. This is a type of distributed denial-of-service attack.

He soon discovered that the culprit was a bot from OpenAI that was relentlessly trying to scrape his entire massive site.

“We have over 65,000 products, and each product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.”

OpenAI server requests trying to download them all; “Tens of thousands” were sent along with their detailed descriptions, along with hundreds of photos.

“OpenAI used 600 IPs to scrape data, and we're still sifting through logs from last week, and it's likely to be more,” he said of the bot's IP addresses that tried to consume his site.

“Their crawlers are attacking our site,” he said, “it's basically a DDoS attack.”

Triplegangers' website is his business. For more than a decade, the seven-employee company has amassed the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models.

It includes 3D object files as well as photos—from hands to hair. Skin and entire body—all by 3D artists; video game developers; It is sold to anyone who wants to recreate digital authentic human characteristics.

Tomchuk's team is based in Ukraine and has a team licensed in the US from Tampa, Florida. Terms of Service page On his site, which prevents bots from taking his pictures without permission. But this alone is nothing. Websites are OpenAI's bot, You must use a properly configured robot.txt file with tags that specifically tell GPTBot to crawl the website. (OpenAI also has two other bots, ChatGPT-User and OAI-SearchBot, with their own tags. According to its information page on its crawlers..)

Robot.txt, also known as the Robots Exclusion Protocol, was created to index the web and tell search engines not to crawl sites. OpenAI warns on its information page that it honors such files when configured with its own set of do-not-crawl tags, but it warns that it can take up to 24 hours for its bots to detect an updated robot.txt file.

As Tomchuk experienced, If a site does not use robot.txt properly. It means OpenAI and others can scratch their hearts out. It is not an opt-in system.

To add insult to injury; Tomchuk expects that Triplegangers will not only be taken offline by OpenAI's bot during the US opening hours, but thanks to the CPU and offloading of all activity from the bot.

Robot.txt is no exception. AI companies follow it voluntarily. Another AI startup, Perplexity, gained notoriety last summer with a Wired investigation. When some witness states that it is not Perplexity Congratulations.

Triplegangers product page — Each of these is a product with a product page that includes multiple photos. Used with permission.Image creditsThree open shop (opens in new window)

I'm not sure what it took.

On Wednesday, days after OpenAI's bots returned, Triplegangers had a properly configured robot.txt file and a Cloudflare account to block his GPTBot and several other bots he discovered, such as Barkrowler ( SEO crawler ) and Bytespider ( Bytespider ). TokTok's writing tool). Tomchuk also hopes other AI modeling companies block crawlers. By Thursday morning, the site was down, he said.

But Tomchuk has yet to find a reasonable way to successfully acquire OpenAI or remove it. I can't find a way to contact OpenAI and ask. OpenAI did not respond to TechCrunch's request for comment. OpenAI has only come so far. It failed to deliver on its long-promised eliminator.As TechCrunch recently reported.

This is a particularly difficult problem for Triplegangers. “If you're in a business where rights are a serious problem because you're scanning people,” he said. With laws like Europe's GDPR, “anyone on the web can't just take a photo and use it.”

Triplegangers' website is also a particularly tasty find for AI crawlers. Multi-million dollar startups like Scale AIHumans carefully tag and create images to train the AI. Triplegangers' website has photos tagged with detail: ethnicity; age tattoos and scars; All body types and so on.

The funny thing is that the greediness of the OpenAI bot warned that the Triplegangers were exposed. Had he scratched more gently, Tomchuk would never have known.

It's scary because it seems like there's a loophole where these companies are using stolen data by saying, “If you update your robot.txt with our tags, you can remove it,” Tomchuk said, but it's the responsibility of the business owner. Understand how to block them.

openai crawler log — Triplegangers' server logs showed how aggressively the OpenAI bot was accessing the site from hundreds of IP addresses. Used with permission.

He wants other small online businesses to know that the only way to discover that an AI bot has taken a website's copyrighted material is to actively look. Of course, he was not the only one under threat from them. Other website owners recently told me. Business Insider How OpenAI bots destroyed their websites and ran up their AWS bills.

The problem has become bigger in 2024. New research from digital advertising firm DoubleVerify AI crawlers detected. And scrapers will drive an 86% increase in “general invalid traffic” by 2024 — ie. It's traffic that's not actually from the user.

Still, “Most sites don't even know they've been scraped by these bots,” warns Tomchuk. “Now we have to monitor log activity every day to detect these bots.”

When you think about it, The whole model works a bit like busting the mafia: AI bots will take what they want if you don't have protection.

“They should ask for permission, not just delete the information,” Tomchuk said.

Source link

I'm not sure what it took.

Leave a ReplyCancel Reply