AI in software development: data scraping

5 min readFeb 26, 2024

AI is an ongoing trend that every startup wishes to implement in their product. It can speed up delivery time and offer new features otherwise complicated to develop. But there are still legal issues, especially around intellectual property (IP) rights, that create concerns for investors. A slowdown in the merger and acquisitions deal market has been reported by S&P Global, suggesting the primary reasons are connected to issues with IP rights. So how can this struggle be avoided? The simple answer is to ensure your IP rights are protected and demonstrable.

In our previous article, we discussed the protection of rights when using AI tools during software development. This time, we’ll look at AI development itself tackling the need for a large sample of data for AI training. We’ll cover both the EU and US perspectives.

Training your AI model

Data is essential. When building an AI product, you’ll soon realize the importance of having a huge amount of data to train your AI model. The question then becomes: How to get the necessary data? A common answer is scraping from the web. Web scraping, or text and data mining, uses software to collect and extract data from web pages automatically. But if you plan to do text and data mining or use that data to train your artificial intelligence model, will it be free of intellectual property issues? Not necessarily.

Restriction for text and data mining (TDM) in European Union

When web scraping in the European Union, you need to be aware of the rules arising from the Directive on Copyright and Related Rights in the Digital Single Market. This law lets website owners stop you from scraping and using their data. If you don’t follow these rules, you could face legal issues and harm your business. To avoid problems, make sure your scraping tools can recognize and respect website owners’ restrictions, like checking for ‘no scraping’ signs in website code (like robots.txt files and meta tags).

Specifically, your software should be able to do at a minimum:

Check for robots.txt directives: Implement functionality to read and interpret robots.txt files on websites. These files often contain rules about what parts of the site can or cannot be scraped, including specific guidelines for TDM activities. Your tool should respect these directives by avoiding restricted areas.
Identify and adhere to meta tags or headers: Enhance your software to detect meta tags or HTTP headers that website owners use to explicitly ban or limit TDM. This means your tool needs to recognize these signals and change its scraping actions to avoid violating the website owner’s rules.

Restricting text and data mining is just one hurdle in the European Union. There are many rules, including EU directives and national laws, that can complicate the issues with intellectual property rights. Even if you comply with TDM restrictions, you might still breach the intellectual property rights of third parties.

Training an AI model in the United States

Text and data mining in the US also comes with its challenges. Recently, new discussions about using copyright materials to train machine learning models were highlighted in Ross’s case. The Ross case provides legal guidance on the interpretation of fair use, particularly in the context of the development of artificial intelligence.

The core of the matter is to decide if the use of copyrighted materials to train AI models is fair use or copyright violation. If a developer’s use falls under fair use, they can use copyrighted materials without the need to get permission or provide compensation to the copyright owner. But if the final work is considered a derivative work, this means that a new copyrighted work has been created by modifying or adapting existing copyrighted material. In such a scenario, the AI developer would need to secure permission from the original copyright holder to lawfully distribute, display or perform the new work.

While the Ross case is ongoing it hits on a discussion of when AI training involves creating a derivative work versus the exercise of fair use. A derivative work would mimic the creative talent of the original, while fair use might involve altering the material for a new purpose, such as analyzing language patterns instead of copying the creative design. Addressing these issues will have far-reaching implications for AI development and the legal frameworks governing intellectual property.

The EU and the US perspectives compared

In the US, fair use permits the use of copyrighted material without a license based on a four-factor test. These factors include the purpose and nature of the use, the type of copyrighted work, the amount used and the effect on the market value of the original work.

On the other hand, the EU doesn’t have an equivalent to the US fair use and follows stricter copyright exceptions for specific uses like teaching and research, offering less flexibility.

Making derivative works in the US usually requires the copyright holder’s permission, unless it’s considered fair use. The EU also requires permission for derivative works, but what counts as a derivative work can vary by country due to different national laws.

Practical recommendations

To strengthen the company’s investment position despite challenges with fair use, derivative works and text and data mining, it’s essential to:

Develop an intellectual property strategy that covers all aspects of AI development and ensures innovation and data protection.
Prioritize legal compliance by conducting thorough due diligence on data sources and ensuring compliance with copyright regulations to avoid infringement risks.
Proactively mitigate potential legal issues by identifying and addressing risks associated with fair use, derivative works, and TDM.
Maintain transparency with investors and potential partners by providing clear information on your intellectual property practices and data use policies.
Seek advice from legal experts specializing in IP law and AI regulations to navigate the legal landscape and ensure compliance with best practices.
Continuously monitor legal developments and industry trends related to fair use, derivative works and TDM and adjust strategies accordingly to mitigate emerging risks and maintain compliance.

Are you looking for legal advice on your new AI tool? Get in touch with us here.