Machine Learning Predicts Clozapine Initiation in Schizophrenia

Clozapine is the only medication with proven efficacy for treatment-resistant schizophrenia, yet most eligible patients wait years before starting it. A 2026 paper by Perfalk and colleagues trains a machine-learning model on routine electronic health record data to flag candidates earlier.¹

Research Highlights

Clozapine is the only evidence-based treatment for treatment-resistant schizophrenia (TRS), but the median delay between TRS criteria being met and clozapine initiation is roughly 4–9 years across countries.² The cost is measurable: continued symptoms, suicide risk, and cumulative functional decline.
The Perfalk 2026 model trained an XGBoost classifier on 194,234 psychiatric hospital visits from 4,928 unique patients in Denmark’s Central Region, using 179 structured EHR predictors plus 750 features extracted from clinical notes via natural-language processing.¹
The model achieved AUROC 0.81 in held-out test data (35,527 visits, 878 patients), with 32% sensitivity and 23% positive predictive value at a 7.5% predicted-positive threshold.¹ AUROC 0.81 is in the “useful clinical decision support” range, not the “diagnostic certainty” range.
The clinical-decision-support framing matters more than the absolute numbers. If implemented as a real-time prompt at psychiatric visits, the model would correctly flag roughly 1 in 4 of its predictions as eventual clozapine initiators within 12 months — potentially shortening the diagnostic-to-treatment gap that drives current TRS undertreatment.
EHR-only models have known generalization limits. The training data is one Danish region with a specific psychiatric-care infrastructure. External validation in different healthcare systems, with different documentation patterns and prescribing cultures, is the obvious next step.

Treatment-resistant schizophrenia (TRS) is defined as failure to respond to at least two adequate antipsychotic trials. About 30% of schizophrenia patients meet criteria, and clozapine outperforms every other antipsychotic in this group — both for symptom reduction and for suicide prevention.³

Despite this, most eligible patients aren’t initiated on clozapine for years after meeting TRS criteria, partly because of the drug’s monitoring burden (mandatory weekly bloodwork for the first 6 months due to agranulocytosis risk), partly because of clinician underconfidence with the protocol, and partly because the moment a patient becomes “treatment-resistant” isn’t always recognized in real-time clinical practice. Perfalk’s model attacks this last gap directly.¹

Perfalk 2026: 4,928 Patients, 229,761 Visits, Two Model Architectures

The trigger paper used EHR data from all adults (≥ 18 years) with a schizophrenia (ICD-10 F20) or schizoaffective disorder (F25) diagnosis who had been in contact with the Psychiatric Services of Denmark’s Central Region between January 2013 and June 2024.¹ The cohort: 5,806 unique patients, 229,761 hospital visits.

The prediction setup: at every psychiatric hospital visit, predict whether an incident clozapine prescription occurs within the next 365 days. The model architecture had two layers:

Structured predictors (179 features): diagnoses, medications, coercive measures (involuntary admission, restraint), demographic data, prior visit history.
Free-text predictors (750 features): derived from clinical notes via natural-language processing, capturing unstructured information physicians and nurses document during visits.

Two model classes were compared: XGBoost (gradient-boosted trees) and logistic regression. Training used 85% of the data with 5-fold stratified cross-validation; performance was evaluated on the remaining 15% (held-out test set: 35,527 visits, 878 unique patients).

The primary performance metric was AUROC (area under the receiver operating characteristic curve) — standard for binary classification with imbalanced outcomes. Clozapine initiation is a rare event at any single visit (the base rate is roughly 1–2% per visit), so accuracy alone wouldn’t capture model utility.

The Model Discriminates at AUROC 0.81

The headline result: best XGBoost model AUROC of 0.81 on held-out test data, with sensitivity 32% and positive predictive value 23% at a 7.5% predicted-positive threshold.¹

AUROC 0.81 sits in a meaningful clinical range. For context:

AUROC 0.50: Random / no discrimination.
AUROC 0.70–0.80: Acceptable discrimination, useful for clinical decision support but not for solo diagnostic decisions.
AUROC 0.80–0.90: Good discrimination, suitable for triage and prioritization tools.
AUROC ≥ 0.90: Excellent discrimination, can support standalone decisions in some contexts.

The 0.81 figure matches or slightly exceeds prior published EHR-based prediction models for similar psychiatric outcomes — for example, suicide attempt prediction (typical AUROC 0.75–0.85)⁴ and psychiatric readmission (typical AUROC 0.70–0.80).⁵

Two-panel chart: ROC curve for clozapine initiation prediction at AUROC 0.81 in held-out data; comparison to baseline psychiatric prediction models (Perfalk 2026) — The Perfalk EHR-trained XGBoost model achieves AUROC 0.81 in held-out data — useful clinical decision support range, comparable to other published psychiatric EHR prediction models.

Why the Performance Numbers Look the Way They Do

The 32% sensitivity at 7.5% predicted-positive rate looks low at first glance, but the trade-off is built into the precision-recall constraint. At a fixed positive-predictive-value (PPV) target, sensitivity scales inversely with the underlying base rate of clozapine initiation in the population.

Think of it this way: the model flags 7.5% of visits as predicted-positive. Of those, 23% (PPV 23%) actually result in clozapine initiation within 365 days. That’s a 4-fold enrichment over baseline.

For clinical decision support, this is potentially actionable. A psychiatric service running this model in real-time could prioritize the flagged 7.5% of visits for case-conference discussion of clozapine eligibility. Even if 77% of flagged cases turn out to be false positives, the 4x enrichment in true-positive density makes systematic case review feasible — which it isn’t if every patient with treatment-resistant features needs manual review.

Three structural factors limit how much higher the absolute numbers can go:¹

Clozapine initiation is genuinely rare per visit. The base rate is roughly 1–2% per visit, even among treatment-resistant patients. A model can only enrich, not invent, signal.
EHR data has irreducible noise. Documentation quality varies, free-text notes are inconsistent, and structured fields (medications, diagnoses) capture only part of the clinical picture. This caps the achievable AUROC.
Decision-making includes non-EHR factors. Patient preference, family input, insurance constraints, pharmacy access, and clinician judgment all shape clozapine initiation decisions and aren’t captured in EHR data. Some variance is inherently outside the model’s information set.

What Popular Coverage of Mental Health AI Misses

Three calibrations matter when reading coverage of machine-learning models in psychiatry.

“AI predicts X” headlines obscure the prediction-vs-decision gap. A predictive model identifies who is likely to receive clozapine (descriptively); it doesn’t decide who should receive clozapine (normatively). The Perfalk model is designed as decision support, surfacing patients who match the EHR pattern of eventual clozapine recipients. Whether those flagged patients should start clozapine is a separate clinical decision.
Single-site model performance overstates real-world utility. AUROC 0.81 in Denmark’s Central Region won’t necessarily transfer to other healthcare systems with different documentation patterns, different antipsychotic prescribing cultures, or different patient populations. External validation studies routinely show 5–15% AUROC drops on out-of-sample data.⁶
Implementation matters more than model accuracy. A 0.81 AUROC model that nobody uses produces zero clinical benefit. The harder problem is integration into clinical workflow: when does the prediction surface to clinicians, how is it presented, what action is recommended, and what training do clinicians need to interpret it appropriately. The Perfalk paper is a model-development study, not an implementation study.

Why the Clozapine-Initiation Delay Matters

Clozapine’s effect-size advantage over other antipsychotics in TRS is among the largest in psychiatry — standardized mean differences for symptom reduction in head-to-head meta-analyses run around 0.3–0.5 over comparator antipsychotics, and the suicide-prevention effect is uniquely robust (clozapine is the only antipsychotic with FDA-recognized suicide-reduction labeling).³

Despite this, the median delay between TRS-criteria-met and clozapine-started is 4–9 years across reported series.² The drivers of this delay are well-characterized:

Monitoring burden. Mandatory weekly complete-blood-count monitoring for the first 6 months (then biweekly through month 12, then monthly thereafter) due to agranulocytosis risk. This is an organizational and patient-engagement burden that other antipsychotics don’t have.⁷
Clinician underconfidence. Many psychiatrists complete training without managing clozapine; community-based providers often refer to specialty centers rather than initiating directly.⁸
Recognition delay. The moment a patient meets TRS criteria isn’t always salient at the visit-by-visit level. EHR-based prediction can surface this.

Of these, the third is what Perfalk’s model directly addresses. The other two require system-level interventions (centralized monitoring infrastructure, clinician training) that prediction models don’t replace.

Limitations of the Perfalk Model

Single-region training. The Central Denmark Region’s psychiatric services have specific documentation conventions, antipsychotic prescribing cultures, and clozapine-initiation thresholds. The model’s 0.81 AUROC reflects performance in this context. External validation in different countries and healthcare systems is the standard next step before clinical deployment, and previous EHR-prediction models have routinely lost 5–15% AUROC on transfer.

Predicting “received clozapine” is not the same as predicting “should have received clozapine.” The training labels reflect what actually happened — some patients who received clozapine probably didn’t need it, and some who needed it never received it. The model learns the empirical decision pattern, including its imperfections.

Free-text NLP features are often the largest performance contributors but the hardest to explain. The 750 NLP-derived features likely capture clinically meaningful patterns (psychotic-symptom severity language, treatment-failure language, agitation/aggression descriptions) but the specific feature contributions are typically less interpretable than structured-data contributions. This limits the model’s ability to explain its decisions to clinicians at the visit level.

The 7.5% predicted-positive operating point is a tunable parameter. Different clinical workflows would want different sensitivity/specificity trade-offs. A higher-threshold operating point would surface fewer cases with higher PPV; a lower-threshold operating point would catch more eventual clozapine initiators but would also surface more false positives. The Perfalk paper reports one operating point; deployment would require workflow-specific tuning.

The model doesn’t address the clozapine-initiation-rate question. Even if implemented perfectly, prediction-only tools don’t change clinician comfort with clozapine, monitoring infrastructure capacity, or patient preferences. Reducing TRS undertreatment requires combining prediction with intervention — education, monitoring infrastructure, and accountability structures.

Pre-registration and bias-mitigation analyses aren’t reported in detail. EHR-based ML models can encode and amplify clinician-level biases (e.g., race-related disparities in prescribing patterns). Whether the Perfalk model’s performance differs across patient subgroups would benefit from explicit subgroup analysis in subsequent reports.

Practical Implications for Clinical Practice and AI in Psychiatry

For psychiatric services considering ML-based decision support, three observations follow.

Clozapine-prediction models are technically feasible at clinically useful performance levels. The 0.81 AUROC is in the same range as accepted clinical prediction tools for cardiovascular risk, sepsis early warning, and psychiatric suicide-attempt prediction. The technical feasibility argument is now settled; the implementation question is what’s open.
Implementation should be incremental and outcome-tracked. Early deployments should run in shadow mode (predictions logged but not surfaced to clinicians) to validate performance, then move to advisory mode (predictions surfaced as suggestions), with measurement of actual clozapine-initiation-rate change as the outcome that matters. Skipping the outcome-tracking step risks deploying a model that doesn’t change practice.
The harder question is what the prediction prompts. A model flag that says “this patient meets TRS criteria; consider clozapine evaluation” is more useful than a probability score alone. Integration with standardized TRS-assessment protocols (PANSS, CGI-S/CGI-I scoring, treatment history review) would make the prediction actionable rather than informational.

For patients with schizophrenia who haven’t responded fully to two adequate antipsychotic trials, the broader implication is the same as it has been for years: clozapine is underutilized, the median delay is too long, and earlier consideration is appropriate. The Perfalk model is one tool that may help close the gap; the more important changes are systemic.

Common Questions About Clozapine Prediction Models

Why does clozapine matter so much for treatment-resistant schizophrenia?

It’s the only antipsychotic with consistently superior efficacy in TRS — about 30–60% of TRS patients respond to clozapine after failing first-line agents.³ Clozapine is also the only antipsychotic with FDA-recognized suicide-reduction labeling; the InterSePT trial showed it reduced suicide attempts and self-harm in schizophrenia patients at high suicide risk.⁹ The combined symptom and mortality benefits make it a meaningfully different drug class within the antipsychotic family.

Why isn’t clozapine started earlier?

Three main reasons: monitoring burden (weekly complete-blood-count testing for 6 months due to agranulocytosis risk), clinician underconfidence with the protocol, and recognition delay (treatment resistance often isn’t named in real-time clinical practice). The Perfalk model addresses the third reason; the first two require system-level changes.

What does AUROC 0.81 actually mean?

It means that if you randomly pick a patient who eventually started clozapine and a patient who didn’t, the model would correctly rank the clozapine starter as higher-probability about 81% of the time. AUROC 0.50 is random; 1.00 is perfect. 0.81 is in the “useful clinical decision support” range — comparable to standard cardiovascular risk calculators (ASCVD scores), psychiatric suicide-attempt prediction models, and sepsis early-warning scores.

Will AI replace psychiatrists?

No, and the Perfalk paper isn’t an argument that it will. The model is decision support, not decision replacement. AUROC 0.81 means the model is wrong roughly 1 in 5 rankings — well below the threshold needed for autonomous decision-making in high-stakes clinical care. The realistic deployment is “model surfaces likely candidates, clinician evaluates and decides.”

Should patients ask about clozapine if they have treatment-resistant schizophrenia?

Reasonable to discuss with the treating psychiatrist if symptoms haven’t adequately responded to two or more antipsychotic trials at therapeutic doses. The standard TRS criteria are (1) failure of at least two antipsychotics from different classes, (2) at adequate dose and duration, (3) confirmed adherence. Patients meeting these criteria are clozapine candidates; whether to start is an individualized decision that weighs efficacy expectations against monitoring burden and side-effect profile (sedation, weight gain, hypersalivation, metabolic effects).

How is this model different from older risk-prediction tools?

Older clinical risk-prediction tools were typically logistic-regression-based with 5–20 hand-curated predictors. Modern EHR-based ML models like Perfalk’s use hundreds of features, including unstructured clinical-note data processed through natural-language techniques. The performance gain is real but depends on having the EHR infrastructure (integrated electronic notes, machine-readable medication and diagnosis coding) that older systems often lack.

Can these models work outside Denmark?

Possibly, but they need external validation. EHR-based ML models routinely lose AUROC when transferred to different healthcare systems because documentation patterns, prescribing cultures, and patient demographics differ.⁶ The Perfalk model is a Danish-specific instance; whether the same approach replicates in US, UK, or other systems would require local training and validation.

References

Predicting clozapine initiation among patients with schizophrenia via machine learning trained on electronic health record data. Perfalk E, Damgaard JG, Danielsen AA, Ostergaard SD. medRxiv. 2026 (preprint). doi:10.64898/2026.04.17.26351083
The trajectory of treatment in schizophrenia: a population-based study from Sweden. Howes OD et al. The Lancet Psychiatry. 2017;4(7):540-549. doi:10.1016/S2215-0366(17)30207-9
Clozapine versus other atypical antipsychotic drugs for schizophrenia: a meta-analysis. Asenjo Lobos C et al. Cochrane Database of Systematic Reviews. 2010;(11):CD006633. doi:10.1002/14651858.CD006633.pub2
Predicting suicide attempts and suicide deaths following outpatient visits using electronic health records. Simon GE et al. American Journal of Psychiatry. 2018;175(10):951-960. doi:10.1176/appi.ajp.2018.17101167
Machine learning prediction of psychiatric readmission: a systematic review. Castro VM et al. JAMA Network Open. 2024;7(5):e2410001. doi:10.1001/jamanetworkopen.2024.10001
External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb. Riley RD et al. Journal of Clinical Epidemiology. 2021;132:88-96. doi:10.1016/j.jclinepi.2020.12.005
Clozapine in treatment-resistant schizophrenia: a positive approach. Kane JM et al. British Journal of Psychiatry. 1988;152(1):50-55. doi:10.1192/bjp.152.1.50
Barriers to clozapine prescribing: psychiatrist survey results. Tungaraza TE, Farooq S. The Psychiatrist. 2015;39(1):14-18. doi:10.1192/pb.bp.114.046656
Clozapine treatment for suicidality in schizophrenia: International Suicide Prevention Trial (InterSePT). Meltzer HY et al. Archives of General Psychiatry. 2003;60(1):82-91. doi:10.1001/archpsyc.60.1.82