Web scraping and the future of AI

by Matt Ober, General Partner at Social Leverage

The courts ruled in favor of Bright Data over Meta. We have clarity for now on what can and can’t be scraped from the web.

Given how much web scraping is happening these days and how important it is for AI, this ruling is interesting. One thing that was clear from this ruling is that web scraping is permissible if you don’t have to log in to a site. For sites where you have to log in and the terms of service say that you cannot web scrape, then I believe you are in clear violation. All that being said, web scraping is happening everywhere.

When I worked in the hedge fund industry, we had many ‘rules of the road’ for our internal teams who were web scraping themselves or partnering with an outsourced web scraping company. A few rules were pretty clear:

  • You can’t web scrape to avoid paying for data. If the company or website you are scraping offers a paid API, then you must engage and pay the company. Pretty fair and straightforward

  • If the company requests via its robots.txt page for you to not scrape a specific page of its website, you must obey. I think it is fair to follow the website’s request. For those unfamiliar, type robots.txt after any website to see what they want web crawlers to scrape or not scrape. For example: https://www.ibm.com/robots.txt

  • If a company requires you to log in or has a click-thru agreement, then you cannot web scrape. This is pretty similar to this recent court ruling

With the value of data going up every day and the amount of data-hungry AI models being trained, the web scraping rules are going to come up more and more. This is the first of many lawsuits. This is the first of many discussions of robots.txt and I think we will see a lot of laws come into question in the coming years. At the end of the day, I think the most clear rule people should follow is, if the company or website sells data, then you shouldn’t scrape to get around compensating them.