A remarkably high percentage of clinical trials fail every year leading to high attrition rates of molecules from the treatment development system.

These rates depend greatly upon the therapeutic area, type of molecule, and lead/non-lead indication status, but most stakeholders agree overall failure rates remain unacceptably high (Kola & Landis, 2004; DiMasi et al., 2016; Hay et al, 2014). Due to the nature of the treatment approval process, failure is both inevitable and costly. However, significant questions about clinical trial failure remain – can we identify failure early and what factors driving failure could/should be modified?


These questions are critical because failure and attrition in the treatment development system results in substantial financial and human costs that impact every American.

Clinical trial failure leads to both financial and human cost. As a result, stakeholders search constantly for the factors driving clinical trial failure and strategies to address those factors. Given these substantial costs, we set out to identify the specific factors driving clinical trial success and failure. The goal was to create a model that could accurately predict which trials are more likely to fail in order to demonstrate that it is possible to detect when failure is more likely earlier in the treatment development cycle and to prevent costly financial and human losses.


We developed a machine learning model using a random forest (RF) model and three sources of data (see box) to predict whether a treatment will successfully transition from one phase to next (e.g., from Phase 2 to 3, from Phase 3 to regulatory approval)1. Machine learning models attempt to “learn” the structure of large, complex data sets and to find nonlinear patterns that are hard to detect using traditional statistical techniques.


  • AggregatedAnalysis of Clinical Trials (AACT) (a publically- available aggregation of
  • BioMedTracker
    (proprietary data from Informa Business Intelligence),
  • BioMedTrackerDrug Search Database.


Our machine learning model identified which protocol and operational characteristics are important to the success and failure of trials. First, an increase in eligibility criteria, the section of the protocol which describes what patients are eligible and ineligible for a trial, reduces the likelihood of regulatory approval. Second, an increase in the number of endpoints, the events or outcomes being measured to determine whether the treatment being studied is beneficial, also reduces the likelihood of regulatory approval.

1 We recognize that this measure of “success” does not get at the “value” of the drug (i.e., patient outcomes) after regulatory approval; however, such measures are not yet available for use in larger analyses such as ours.

For example, the addition one endpoint to a Phase II Oncology trial decreases the odds of regulatory approval by 10%. Both of these findings reflect many conversations we have had with stakeholders and experts, that more complex protocols often lead to failure.


Both of these findings reflect many conversations we have had with stakeholders and experts- more complex trial design often lead to failure. There is a vicious cycle in treatment development: when a disease or molecule is less well-understood, the clinical trial protocol grows in complexity. In other words when there is more uncertainty, there are more eligibility criteria and more endpoints to manage that lack of understanding. But our analysis, as well as the findings of other researchers, indicates that this strategy leads to failure.

Due to limitations in our dataset, we can’t prove whether our model predicts failure solely as a result of complex trial design, or if there are a number of co-variants involved, i.e., molecule characteristics, disease characteristics, if there was a lack of understanding in regulatory barriers and human resources. If not, consider making “or lack of complete scientific understanding”.Our model predicts outcome with an accuracy of 80%. So while we know trials fail for many reasons, including reasons we were unable to train our model on; we believe it is still possible to identify which trials are inherently more prone to failure before they even open to enrollment. Identifying these trials earlier in the treatment development cycle by using models like ours during the facilitated review process can provide more information to sponsors facing difficult decisions about whether these higher risk trials should be modified or halted.


Kola, I., & Landis, J. (2004). Opinion: Can the pharmaceutical industry reduce attrition rates?. Nature reviews. Drug discovery, 3(8), 711.

DiMasi, J. A., Grabowski, H. G., & Hansen, R. W. (2016). Innovation in the pharmaceutical industry: new estimates of R&D costs. Journal of health economics, 47, 20-33.

Grignolo, A., & Pretorius III, S. (2016). Phase III trial failures: costly, but preventable. Appl. Clin. Trials., 25(8).

Harrison, R. K. (2016). Phase II and phase III failures: 2013-2015. Nature Reviews Drug Discovery, 15(12), 817-818.

Hay, M., Thomas, D. W., Craighead, J. L., Economides, C., & Rosenthal, J. (2014). Clinical development success rates for investigational drugs. Nature biotechnology, 32(1), 40-51.

Co-Research Leads:

Sauleh Siddiqui
Jen Bernstein