AI in pharmaceuticals is set to transform drug discovery, clinical trials, manufacturing, and marketing by analysing vast datasets to speed up processes, reduce costs, and enable personalised medicine.
Applications being worked on include identifying drug candidates and predicting protein structures to optimising supply chains and automating regulatory tasks, though challenges like data quality and transparency remain. There are examples whereby AI has helps find new targets, design molecules faster, recruit trial patients better, and create tailored treatments, making drug development more efficient and precise.
Yet how does this technological revolution fit with the pharmaceutical regulators who oversee the pharmaceutical sector at national and supranational levels?
The European Medicines Agency (EMA) has become the first pharmaceutical regulator to produce a draft guidance on the use of artificial intelligence as applied to the development and manufacture of medicinal products. This comes at an important juncture, since the benefits and errors in relation to AI are at a pivotal point.
Termed Annex 22, the draft document represents a new regulatory annex focused on the governance, validation, and oversight of AI/ML systems used in Good Manufacturing Practice (GMP) environments. The draft text closely complements Annex 11 (which is in place for computerised systems); where the two documents are designed to prevent unsafe use of adaptive or opaque models in critical GxP processes.
As to what Annex 22 contains, my assessment is:
Scope – very strict boundaries
Annex 22 applies only to static, deterministic AI/ML models used in critical GMP processes. This means that static machine learning models; deterministic models (same input → same output); and critical applications only with strict controls, are permitted.
Whereas explicitly excluded are dynamic / self‑learning models; probabilistic models; generative AI and Large Language Models (LLMs). The Annex specifically states that generative AI/LLM use is only acceptable for non‑critical GMP tasks with HITL oversight. Human-in-the-Loop (HITL) is an AI and machine learning approach where human interaction and intelligence are integrated into the system’s training, testing, and operational cycles.
This is a very high bar and one that disqualifies many commercial AI tools, unless they are configured to become heavily constrained.

Heavy emphasis on cross‑functional accountability
The Annex mandates that all subject matter experts, data scientists, Quality Assurance (QA), IT, and vendors must collaborate from algorithm selection to operation. In order to chart this process, clear documentation is required regardless of whether the model is built in‑house or by suppliers. For this, quality risk management needs to underpin all decisions.
Furthermore, each pharmaceutical organisation using AI needs to develop and put in place a strong governance framework for AI.
Intended use – must be extremely well defined
Acceptance testing in pharmaceuticals consists of formal, documented, and GMP-compliant validation of equipment, specifically through Factory Acceptance Testing (FAT) and Site Acceptance Testing (SAT). FAT verifies equipment at the vendor’s site before shipping, while SAT confirms functional, integrated performance in the final operating environment.
In relation to this, the Annex indicates that before acceptance testing commences there needs to be a full characterisation of the input sample space, including the identification of rare variations. To achieve this, subgroups must be identified (e.g., site, equipment, defect type) and HITL responsibilities must be explicitly defined and monitored.
Acceptance criteria – statistical expectations
To assess the success of AI, the Annex requires:
- Clear test metrics (accuracy, sensitivity, etc.).
- Acceptance criteria set by experts before testing begins.
- The AI model performance must be superior the process it replaces.
This presupposes that the current manual/automated process intended to be replaced by AI have known, documented performance metrics.

Test data – high statistical and procedural rigor
Test data used to assess the AI must represent the entire input space (including rare edge cases). The Annex also calls on the data set to be large enough for statistical significance and to be labelled with extremely high accuracy.
Interestingly, the Annex also states that to assess an AI the users must avoid generative AI‑created test data.
Test data independency – strong separation-of-duties
To ensure the AI development process remains free from bias, the Annex has put in place a series of controls. These include:
- No shared use of training and test data (to ensure that data remains free from contamination).
- Access‑controlled, audited repositories.
- Developers must never access test data.
- Staff who have seen test data cannot train the same model unless under 4‑eyes control.
- Physical objects used for testing cannot be reused for training.
Hence, this requirement enforces strict data segregation.
Test execution
To test out the suitability of the AI, the Annex requires the following:
- Demonstrating generalisation (no over/underfitting).
- A fully predefined test plan with metrics, test scripts, and data references.
- Deviation handling identical to standard GMP deviation processes.
- Retention of all test artefacts including audit trails and physical test objects.
Explainability – mandatory in critical applications
Each AI model must provide feature attributions. These are explainable AI (what is sometimes abbreviated to XAI) techniques that assign importance scores to input features, quantifying their influence on a machine learning model’s prediction. These methods help determine how specific inputs—like pharmaceutical product yield—drive model behaviour for making predictions. The intention is to offer insights into model transparency and decision-making.

To demonstrate ‘explainability’ SHAP and LIME are popular, model-agnostic techniques and they are used to understand machine learning model predictions, differing mainly in their approach:
- LIME (Local Interpretable Model-agnostic Explanations) builds simple, local linear models around specific predictions.
- SHAP (SHapley Additive exPlanations) uses game theory (Shapley values) is used for more robust, mathematically grounded feature attributions, offering both local and global insights.
AI “black boxes” are called out as being unacceptable in GMP environments.
Confidence – controls against uncertain predictions
To build confidence that the AI model is doing what it is intended to do, each model must:
- Log confidence scores.
- Employ thresholds to avoid unreliable outputs.
- Output “undecided” where confidence is low.
These features are seen as preventing inappropriate automated decisions from occurring.
Operation – strict life‑cycle governance
To ensure that the AI model operates across its intended lifecycle, the Annex requires that each change is documented and assessed and that configuration controls are in place to detect unauthorised changes.
AI is likely to increase in pace and application within pharmaceuticals. The draft Annex provides some clarity as to what will be expected by pharmaceutical regulators within the European Union. The Annex text has recently closed for public comment and a finalised version is expected to be issued later in 2026.
