Opinions expressed by Digital Journal contributors are their own.
AI agents — autonomous systems that are designed to perform specific tasks for a business or individual — are on the rise. In fact, eMarketer reports that 51% of organizations are exploring how to use AI agents, with another 37% already piloting them. While they can be trained to perform a wide range of tasks, AI agents are ultimately only as good as the data that is collected for their training models.
Generally speaking, acquiring data from the open web is considered the best way to train AI agents. As Or Lenchner, CEO of Bright Data, explains, there are several best practices for acquiring and using public web data for this developmental purpose that all organizations should follow.
1. Real-time collection is essential for AI agents
According to Lenchner, one of the biggest ways AI agents are differentiated from LLMs is how they operate in a real-time environment. This means that unlike LLMs, they can’t just rely on historical data. Continuous updates are essential for an AI agent to make accurate decisions and adapt to changing conditions.
“AI agents require real-time or near-real-time data to remain effective. Batch processing isn’t enough for AI-driven applications. Parallelized scraping, caching mechanisms and API-based data feeds ensure that AI models are always operating on the most current and relevant information.”
The need for real-time data can make all the difference in activities like travel booking or providing financial information. Making use of processes that keep AI agents continuously up to date must be a top priority when creating a data collection strategy.
2. Develop a data pipeline
Because AI agents require a continuous flow of real-time data, developers must establish systems that will enable this to actually take place. As Lenchner explains, “The biggest challenge is establishing a data pipeline that’s scalable, processed and dynamic. Many developers struggle working with fluid data sources, JavaScript-heavy sites and anti-bot tactics. The best developers will employ tools that handle complex site structures, automate retrieval and integrate smoothly into their AI models.”
With a streamlined data pipeline in place, businesses can have greater confidence that their AI agents will remain relevant with up-to-date information, even in industries like finance and e-commerce, where change can occur quickly.
3. Focus on quality sources
As part of the process of creating a data pipeline, developers must also place heavy emphasis on the types of sources they use to acquire data. In an age of widespread digital disinformation, the last thing any organization needs is for an AI agent to become corrupted by low-quality data.
“If you don’t put high-quality data into your AI model, it won’t be high-quality either. First, you have to collect data from trusted, authoritative sources in order to reduce misinformation and bias,” Lenchner says.
“Next, automated filtering and deduplication are essential. AI models can be skewed by duplicate, outdated or irrelevant data. Adopting automated processes that clean your data will ensure cohesion, and adding a layer of human supervision into the mix will ensure that your data is accurate and contextually meaningful.”
4. Create a system for providing clean data
Even when data is acquired from quality sources, developers must use systems that clean data for use by their AI agents. Without proper oversight, it can be all too easy to feed an AI agent poorly structured or inconsistent data, even when limiting oneself to trustworthy sources.
As Lenchner explains, “Raw web data can be messy, outdated and unstructured, which could lead to poor AI model performance. Establishing a workflow with automated deduplication, normalization and schema mapping ensures that models receive clean, usable data. AI models trained on inconsistent or outdated data will struggle to generate accurate outputs.”
Clean data will ensure that AI agents are accurate, reliable and ultimately equipped to deliver their intended functionality.
5. Be aware of privacy concerns
Privacy has become a major topic of concern with AI, particularly in regard to how developers collect data that will be used by AI agents. Accusations of organizations using private messages to train AI highlight the level of concern and potential repercussions of data collection efforts.
“The most important consideration to make is that only publicly available web data is being collected, and that it’s being collected in a responsible and ethical fashion,” Lenchner advises.
“Data teams must pay close attention to website terms of service and global data privacy regulations, such as GDPR and CCPA. And when that data has been collected, companies are expected to handle it in adherence to ethical AI development. Organizations are also expected to be fully transparent about how they access and use public data.”
Data teams can streamline this work by implementing protocols that prevent collecting data in a way that runs afoul of privacy regulations. This extra step is critical for maintaining their organization’s reputation and avoiding PR catastrophes.
Train AI agents the right way
As Lenchner’s insights reveal, properly training AI agents requires a clear set of processes for sourcing, collecting and cleaning data. By providing the necessary framework and human oversight, developers can ensure that AI agents will deliver dependable support and fulfill their intended purposes. As developers implement these principles, they can create better and more effective AI agents.
