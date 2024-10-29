Photo courtesy of Zydrunas Tamasauskas

Opinions expressed by Digital Journal contributors are their own.



Zydrunas has spent over 20 years in the IT industry, working in various fields of software development. As the Chief Technology Officer at Oxylabs, a leading web intelligence acquisition platform, Zydrunas manages a large team of engineers and shoulders the responsibility for developing Oxylabs’ products and finding new technological directions.

Recently, the media has been filled with stories of big AI companies getting in trouble for scraping web data. Have you noticed an increased demand from AI companies?

Yes, definitely. Today, many more AI firms are coming to us looking for web scraping services than a year ago. Their data needs are getting bigger and more nuanced since most AI systems still rely on stochastic ML models that are only as good as the data they are trained on. However, collecting extensive data for model training in-house is complicated — it requires personnel with specific skills and costly data acquisition infrastructure, such as reliable proxies.

Increasing interest from AI companies is both an opportunity and a challenge for web scraping providers. The EU AI Act, which has just entered into force, lays down various levels of requirements for AI firms, from explicitly prohibiting specific AI systems to requiring transparency in data collection and management practices. Indirectly, it also brings certain responsibilities for the EU-based web scraping providers — we must ensure we do not serve AI firms that do not comply with the regulation.

As you have noticed, many AI firms have been under fire recently — even if they are training their models on publicly available web data, concerns regarding copyrighted data ownership and data privacy still arise. Around 150 zettabytes of data is generated annually, and it’s a common good, like air or water. Nobody thought a day would come when somebody would actually try to scrape the entire internet and use this data. So, AI brought us to an unprecedented situation. It is a considerable challenge since there are no clear answers (neither from a legal nor moral point of view) on how public web data collection for AI training needs should be regulated.

What part does AI play in your products?

Over the last few years, Oxylabs started using increasingly more AI-driven features to automate mundane web scraping processes. We started using ML models back in 2021. Today, these features range from ML-driven proxy management and response recognition to the newly released OxyCopilot — an AI tool that allows users to generate API payload requests and parsing instructions by entering simple natural language prompts.

Parsing is a perfect example of how AI can be used for automation. Parsing is a time-consuming task often performed by junior developers since it doesn’t require an extensive skill set. Nevertheless, it is costly — a US and UK developers’ survey, carried out by Oxylabs and Censuswide in August 2024, showed that a majority of developers (75%) working on web scraping tasks spend from 11 to 40 hours per week building and maintaining data parsers. Parsing pipelines can break as often as several times per week or even every day — it happens due to dynamic website layouts and, in some cases, chaotic representation of data on the same site.

It took our team of ML engineers only three months to develop an AI Copilot that can assist developers in building and fixing parsers. The feature recognizes instructions in simple language, indicates complex parsing patterns even when provided with multiple URLs, and provides parsing instructions within minutes. Since it doesn’t require calling LLMs for each request, it is easily scalable. At first, we intended to use OxyCopilot to help our own developers save time and resources wasted on the parsing process; however, soon, we noticed there was a wider demand in the market for AI-driven parsing process automation.

Is OxyCopilot suitable for people with little coding experience, for example, scientists or investigative journalists who need public web data to do research?

Yes, that was one of the main goals when we started developing it. Of course, one needs to have at least a basic understanding of web scraping and its processes; nevertheless, OxyCopilot is a big step towards democratizing web data collection by lowering the entry level for smaller companies and bridging the skills gap between experienced developers and the low-code community.

Creating new parsing instructions with this tool does not require writing any code. Running requests to the Web Scraper API, on the other hand, will require some programming, but our GitHub documentation covers it all step-by-step.

As a part of a larger platform — the Web Scraper API — OxyCopilot solves at least three of the biggest parsing-related challenges developers tend to mention. The first is identifying complex parsing patterns, the second is extracting nested or listed information, and the final is adapting to constantly changing website structures. So, although you still need some basic scraping knowledge to get data with OxyCopilot’s assistance, you won’t face as many complex challenges as you would without it.

Do you believe we’ll get to the point where AI will fully automate web scraping?

I believe that fully automating the entire data collection process would require something called general artificial intelligence, and technology is way beyond that today. Web scraping requires extensive knowledge in coding and data science to determine which data is relevant, bypass anti-scraping measures, debug the code if necessary and make optimal cloud storage decisions.

Moreover, collecting web data at scale requires navigating different legal regulations, from the site’s Terms of Service to data privacy laws. Some of these matters might be of a purely interpretative nature. In short, web scraping involves different moving parts that require the ability to think out of the box. If you are collecting web data on a massive scale, keeping a human in the loop remains necessary.

In your business, do you witness a negative side of AI?

As with any technology, AI can be exploited for various purposes. Let’s take an example of anti-scraping measures — in some cases, they are necessary to protect the website from non-ethical scraping practices. However, cybercriminals can also use robust anti-scraping measures to shield themselves from threat intelligence.

AI also helps create more efficient honey-pots — fake or erroneous data fed to an unsuspecting scraper, damaging the flow of data, which can bring substantial loss to some businesses. Further, AI allows altering content, such as images, making them unscrapable or useless.