Defining + Decrypting Quality Data in Health Care AI with Recursion
Guest: Najat Khan, Ph.D., Chief R&D Officer and Chief Commercial Officer, Recursion
Host: Brendan Smith, Director, Life Science & Diagnostic Tools and Biotech Analyst, TD Cowen
The TD Cowen Machine Medicine: AI & Health Care podcast series takes on the cross-sector artificial intelligence revolution and breaks it down one piece at a time. We highlight some of the biggest misconceptions about how AI is and can be used in different health care settings and aim to contextualize the latest and most impactful developments in a field defined by rapid innovation.
In this episode, TD Cowen Health Care Analyst Brendan Smith hosts Najat Khan, Ph.D., the Chief R&D Officer and Chief Commercial Officer of Recursion, to pull back the curtain on one of the largest datasets in AI drug discovery and help us define what "quality data" actually looks like. We discuss how the data you gather, the data you generate and the data you ultimately use to train your AI model play such a critical role in its differentiation and future success.
This podcast was originally recorded on April 11, 2025.
Speaker 1:
Welcome to TD Cowen Insights, a space that brings leading thinkers together to share insights and ideas shaping the world around us. Join us as we converse with the top minds who are influencing our global sectors.
Brendan Smith:
Welcome back to another episode of Machine Medicine: AI & Health Care, TD Cowen's podcast series where we bring you the latest and most important takeaways from the state of AI and the healthcare sector today. I'm your host, TD Cowen biotech and life science tools diagnostics analyst, Brendan Smith. And today I'm joined by none other than Recursion's chief R&D and Chief Commercial Officer, Dr. Najat Khan.
Najat, It's great to have you and welcome.
Najat Khan, Ph.D.:
Thank you. It's great to be here today.
Brendan Smith:
So as you all know, this podcast series is specifically designed to break down this massive cross sector, almost all-consuming revolution that is artificial intelligence into individual points, one at a time, highlight the biggest misconceptions, and then recontextualize each piece back into the healthcare macrocosm. A tall order admittedly, but I think a worthy endeavor nonetheless.
Today, Dr. Khan and I are looking to pull back the curtain really on one of the biggest data sets in AI, drug discovery, and discuss how the training data you have, the data you generate, and the data you use plays such a critical role in the world of AI-powered drug discovery.
Najat, we, as I'm sure you do, continuously get questions from investors, academics, corporates, about how to distinguish different AI platforms, but maybe more importantly for today, what makes a quality data set and how much training data really is needed. Maybe let's just dive right in. I know you've spent much of your professional academic career centered in this computational side of biology and chemistry really landing you pretty squarely in the middle of what's now being dubbed the tech bio industry. Maybe just first help us get oriented, the 1000-foot view. What are the most important considerations when building a new AI ML model and platform?
Najat Khan, Ph.D.:
No, that's a great, great question. There's so many companies, it's various platforms, that it becomes important to look under the hood. I think step one, the first consideration is going to be something that probably feels pretty non-obvious, but it's around what questions are you trying to answer? What are the core elements you are trying to enhance or improve? If you take drug discovery, what are the two, three things that really bend the success rate in the wrong direction? One is do we really understand holistically the biology that's driving the disease? The second one is around the design of the molecule and what are potential off-target effects, we're going to talk about that, predicting toxicity and so forth. And then the third is also just once you get into clinical trials, being able to simulate trials and who's going to be a better responder from a patient perspective.
Once you know what the core questions are, then the next question is what are the right data sets that help you answer those questions? You asked something around quantity. I also want to bring in the focus on quality. Quality and quantity are both important in the sense that look, quantity, if you don't have a minimum viable amount of data sets, of course statistically things are not going to be powered. But on the quality side, I would say so much of the time the data scientists spend on, and a lot of our engineers, 50 to 60% if not more, is around ensuring the data is clean, not redundant, reducing error rate and so forth, right? Because you don't want to reduce the signal-to-noise. When I'm looking at companies, I'm focusing on how much of that data, the quality and the quantity is directed to something fit for purpose so you can actually answer the core questions that are needed.
Then the last thing I'll say is don't want to forget about the algorithms. They're very important, but garbage and garbage out. The data sets have to be critically important. Often I'll get the question that when you look at other industries, like look at ChatGPT, it doesn't take that long. Why does it take so long to build a platform in the AI tech bio space? I'll say, "Look, most of the corpus of data that's needed, a lot of it is actually available for what ChatGPT and other LLM models have been trained on. But if you look at biology, if you look at chemistry for biology, we probably understand 10 to 15% of biology." There's so many gaps that unless companies like Recursion and others are investing to really generate those fit for purpose data sets focused on quality and quantity, that's what you can then use to map out the entire human biology map. Not just one protein that causes an issue, but the whole host of the environment and the pathways and the connections that give us a better understanding of not just the biology but potential off-target effects down the road.
Brendan Smith:
Yeah, so it really seems it's not so much always the biggest data set is going to give you the best model. I think maybe amongst a lot of the investment community, there's a running theory for at least a while that perhaps bigger is just inevitably going to get you there faster. But where we are today, maybe it's not necessarily always going to be the case. I guess maybe this is a great lead into my next question really. I guess what does, from where you stand, the investment community maybe underappreciate or misunderstand altogether about what's already been accomplished to date in this respect?
Najat Khan, Ph.D.:
That's a great question. I'll go through both discovery and development. I think from a discovery perspective, if you look at our understanding of biology, that is increasing. There's so many novel targets that are being identified, leveraging computational approaches including AI and deep learning. The other piece is around design of molecules. There's a lot of molecules that are leveraging computational tools. I'll bifurcate it in two fronts. One is biologics. Biologics, one of the data sets that's used, the protein database, that's a couple of hundred thousand of really beautifully annotated structures that has really helped propel the design of a lot of biologics. For small molecules, some of these data sets are being generated today. But for instance, at Recursion, we have several programs in the clinic, over 20 in discovery partnered and non-partnered, internal. A lot of these molecules are being designed in 18 months, 15 months, a fraction of the time that it actually takes start to finish, usually in industries 42 months. You're synthesizing thousands, tens of thousands of compounds versus we do 200, 300. You're starting to see I call them the green shoots of success.
I don't want to overstate it. At the end of the day, things need to work in the clinic and our success rate in the industry, it's 10%. You got to have multiple shots on goal and then to be able to understand how it's doing. But the early signs before an inflection curve is always can you design better, faster? Can you get to the right answer in a quicker way? I think that's where you're seeing both on the chemistry side as well as the biology side, these green shoots of success.
If I can just say also on the clinical development side, folks are starting to use AI both on the precision medicine side and then also on clinical trial recruitment. Trials take a very long times to recruit. 80% of trials don't recruit on time. Precision medicine, if you don't pick the right patients, you dilute your signal-to-noise. I think there are other examples too we can talk about where we're starting to see those early signs of success.
Brendan Smith:
Yeah, I think that's such an important point here. In years past, I found so many people who are even really wanting to get more in the weeds on this AI drug discovery side of the conversation. You inevitably have a cohort of investors who kind of say, "Well, maybe the phase 2 data is years away, phase 3 `data even farther away. I don't feel like waiting for that. What else can I point to?" For a while, there's not been a ton of publicly available information at least for us to be able to compare some of this. I give your team a lot of credit for looking at how long has some of these development programs taken, how many have we been able to develop, how much cost and time savings, putting actual numbers behind some of this, I think increasingly is becoming more and more of a currency when we talk about this too.
I think this is a really helpful, again, segue into the next thing I wanted to address really is the question I get frankly all the time, I'm sure you do too. But how can someone on the outside of these companies, we don't have access to all the data sets that you have, all the models, algorithms. I'm not a trained computer scientist by any means so even if I did, I probably wouldn't even know what it meant. But whether it's an analyst, investor, my neighbor who's been watching a ton of sci-fi and really interested in AI all of a sudden. How can the rest of us work to understand the value or quality of a company's data set? Maybe that question is better phrased what can we ask you or your counterparts to better appreciate some of these differences between the sets and platforms?
Najat Khan, Ph.D.:
Yeah, there's a few things. First, I would start by saying what is the differentiation that this platform has? What I'm always listening for is what is the problem, those three problems I mentioned whether it's biology, chemistry, picking the right patients, precision medicine, what are you solving for? Next thing is show me how the data sets, any examples are actually being able to solve that and compare that. I like to have benchmarks. Then the third thing is I also try to understand the data sets. I want to look under the hood and understand how much of that is truly proprietary, quality, quantity, don't just give me quantity.
Look, the last thing I'll say is looking at the management teams and understanding their experience is extremely important. Drug discovery and development is one of the most humbling things. Like you said, it's a very noble endeavor, but it's very hard. It's very painful. 10% success rate means most things don't work. What I'm looking for is how much battle scars and wounds do you have? Do you actually know the reps that it takes to know what good looks like? I think that's a very important in this tech bio space where you're merging both that drug hunter expertise who is not too conservative and is actually open-minded to leveraging AI machine learning. That's sort of bilingual mentality. The team is critical.
The last thing I'll say is also you can have the best data, the best algorithms, the best questions you want to answer. How are you integrating? It's when and how are you leveraging that across to make a better drug? Just having an algorithm that's great at predicting protein binding, that's great, but is it actually going after the right protein that is the driver of the disease? Is it something that's going to circulate in the body long enough to have its effect and not just processed by the liver in 30 minutes? That's an issue. Do you know what I mean? There are a lot of practical considerations. I'm looking for people and ideas and opportunities that are bold, but pragmatic at the same time.
Brendan Smith:
Yeah, and I think you addressed this fit for purpose.
Najat Khan, Ph.D.:
Yeah.
Brendan Smith:
I think it's increasingly just a phrase that we're hearing more and more because it's like what are you trying to actually accomplish? Not at a broad existential level, but ultimately a lot of these models are designed to answer some specific questions. What are your questions?
Najat Khan, Ph.D.:
Yep.
Brendan Smith:
How effective are you getting to that answer? How good is the answer?
Najat Khan, Ph.D.:
Yeah.
Brendan Smith:
Very, very kind of brought down to size.
Najat Khan, Ph.D.:
I'll give you an example. Recursion started with being super patient-centric by doing phenotypic screening, which is using patient cell lines or primary cells, whatever it might be, and knocking out every gene, all 20,000 genes, and creating a heat map that basically gives you gene-gene interactions. You add compounds on top of that, it gives you gene compound interactions. Why is that important? You get a whole map of the pathways, the protein clusters, and so forth. Great. But then you also get a sense of what is that initial chemical substrate that might actually do something to modulate. Why am I mentioning that? Why is that fit for purpose? It's fit for purpose because you're not just looking at just one class of proteins and have a selection bias there. You're looking across the entire genome. You can do overexpression knockout and so forth.
I think that's important too. I want to understand what the edge cases are. If you're too focused in a certain area, you don't see the whole distribution and then you might miss something. That as an example, is fit for purpose data. Same thing in clinical development. You want to have patient data, great, you have phenomic data, but I want to see transcriptomic. I want to see clinical outcomes of patients. I want to have genetic data. Once you start building that stack, as we call it, eventually doing a virtual cell or a virtual person, that becomes critical to understand the connectivity across different layers and how does something from a biological perspective translate to clinical outcomes, which is ultimately what we're doing here. Because we're trying to make better medicines for patients.
Brendan Smith:
Clinical cell and a clinical person, that's definitely something we'll all be watching for. I think you're also getting at what's increasingly a question that is often circulated here. Is there a world where companies limited to public data sets are able to compete with proprietary data platforms?
Najat Khan, Ph.D.:
Yeah, it's a good question. I think given the limited amount of fit for purpose, high-quality, and high-volume data that exists in the biotech or life sciences space, let me just broaden it, that generating data is going to be important.
However, here's what's happening. When I look at the world, I see a tale of two cities. There are some companies that are generating data, but they're not necessarily always asking the right questions. It's data for the sake of data. Let's build a platform and they will come. That is not the right approach. It has to be for a purpose. Then there's another set of companies that are not generating data, but they're actually using it in very ingenious ways.
What you want to do is have the best of both. You want to have proprietary data where that matters, partner with other data sets that exist. For instance, we have proprietary data in the biology chemistry side, like I mentioned. But we also partner with companies whether it's Tempest, Helix, and also the clinical development. You're being opportunistic and smart, best of both worlds. Then the question is, are you driving for the right questions? Are you driving towards the right questions? Are you actually integrating it across the entire stack from target ID or biology hypothesis, chemistry, clinical development, clinical execution? That's what you need to do.
Brendan Smith:
Yeah, and I think that's a largely underappreciated aspect of the depth of this entire process that's really required here. Again, I think this is also getting another side of this conversation that's really when you're looking at areas with limited clinical or bioinformatics data to train on something like orphan diseases, for example. You're talking about some of these clinical training sets. How do you get around some of the challenges of building a model and hopefully developing a drug down the road with either lesser-known disease pathology, not as great access to the patient population itself? What are some of those considerations that go into building those models?
Najat Khan, Ph.D.:
Yeah, that's a great question. I'll start with the biology, understanding what's driving the disease because there's over's hundreds of rare diseases, orphan diseases. But if it's a monogenic one, I think from a biology perspective, you can do the knockout phenotypic screening, many other approaches. It's doable. I think when it's polygenic or just complex for multiple reasons, then it becomes an approach of multimodal data becomes critical to basically stratify patients. Because it's not just genetic. You're looking at transcriptomic, proteomic. You're also looking at clinical outcome data.
The second thing I'll say is look, in terms of designing a molecule for it, this is why generative AI and active learning is so important, which is something of course Recursion has from what Recursion is built but then also the acquisition, or I should say the combination with Exscientia. We can talk a little bit more about that.
But once you get to the clinic, there are two things I want to point out. A lot of these diseases, it's hard to find patients, as you mentioned, it's because they're not diagnosed on time. They get misdiagnosed. Most rare diseases, you get misdiagnosed five, six times. One of the things I've done in my prior life, instead of trying to use real world data and understanding which patients might have a disease, you can actually use, for instance, for pH, which is a rare disease, ECGs, and you can pick up the disease from the ECG scans. What ends up happening is at the primary care stage, you can actually pick up the patients much, much faster because a lot of the doctors aren't thinking about those rare diseases. But you can actually use data and deep learning algorithms because they're so much more sensitive in terms of picking up the signal. Just like we do with computer vision layered on top of [inaudible 00:16:40] images.
The other thing I'll also say is recruiting patients and rare diseases is one of the hardest things to do. This is something for me, I'm personally building that out as well here, which is you're using real world data, clinical trial data, and then on top of that, adding AI machine learning to create a hotspot of where are the patient population that's most relevant for your study? Of course anonymized, but which sites they're at. We talk about being patient centric.
We started from the get go using cell lines and not just animal models. We use that to augment. It's an and, but that's not where we start. We're also exploring approaches with organoids, predictive talks. We'll talk a little bit more about that with some of the new FDA guidance, which is great to see. Then all the way through the clinic. You're going to where the patients are versus what we do today is we go to sites that we know of versus let's not just use our knowledge, let's use the world's knowledge. Let's use data to tell us where the patients are and to go to them to accelerate recruitment.
Brendan Smith:
Yeah, that's great to hear. You referenced also the merger between Recursion and Exscientia.
Najat Khan, Ph.D.:
Yeah.
Brendan Smith:
Maybe let's talk a little bit about that just for a minute. I think earlier this year merged into basically a single AI powerhouse, which I think fits well was in this conversation for a couple of different reasons. But maybe without even diving into any of the financial logistics of it. But really, what was it about the technological marriage of the platforms that makes the Recursion of 2025 different than in years past?
Najat Khan, Ph.D.:
Yeah, look, I think it's so complimentary. When you're building something like this, a tech bio or a biotech powerhouse, you want to have the end-to-end suite. Recursion really started on the biology front, pioneering this aspect of penotypic drug discovery, and then adding multimodal layers to that. Exscientia really started on the chemistry side. For [inaudible 00:18:29] ID, lead optimizations. It's really impressive to see.
What I love about the two platforms coming together is it's very modular. As the technology progresses, which it is very fast, we can be opportunistic about the best open-source model we can bring in or the models that we have. Again, it goes back to the principle of integration. You want to have the best AI algorithms and data sets on every step of the core question that really upends the success rate in drug discovery and development. That's one.
The second thing I'll also say is, look, it gives us an opportunity to really have more optionality in our portfolio. Complementarity in the platform, complementarity in the portfolio, oncology and rare disease primarily. Then I would say also bringing the best talent in the space, getting really you're saying about your neighbor, et cetera, I get so many questions. Also in my prior life where I built an AI team from scratch in a large company, getting folks that understand computational approaches and AI and machine learning, et cetera, but also chemistry, biology, and clinical development. Not just that, but they almost know two languages and enough to appreciate each other wherever your native start is. That is one of the hardest things to do. One of the hardest things to do. That was an amazing opportunity to bring the best of the best together.
Brendan Smith:
Yeah, I think earlier you called it a bilingual comprehension.
Najat Khan, Ph.D.:
Yes.
Brendan Smith:
I love that expression too. I think it's so fitting for really the state of where this is today.
I think I would be a little remiss if I didn't address one of the other hot topics of the day that you also referenced earlier, which is some new guidance out of FDA that just came out. To put it bluntly, FDA is looking to phase out animal testing requirements for monoclonal antibodies and maybe some other undisclosed drug modalities. It's a little unclear based on the press release, at least at this point in lieu of what they call new approach methodologies or NAMs. Really just a complete paradigm shifts out of the agency. I think specifically, I can actually just read a quote here that they have by [inaudible 00:20:33], "By leveraging AI-based computational modeling, human organoid model-based lab testing, and real-world human data, we can get safer treatments to patients faster and more reliably, while also reducing R&D costs and drug prices. It's a win-win for public health and ethics." It's from the FDA press release that's online.
Maybe let me just first ask you, number one, how transformative is this really for drug developers. Then secondly, what does this mean for Recursion?
Najat Khan, Ph.D.:
Yeah. Just big picture, this is absolutely the right direction we should be going. I love in the quote, and it resonates a lot with me scientifically and ethically. Ethically, both for patients and for animals. I'm an animal lover, so I can't help but underscore that.
Look, at the end of the day, you want to, I'm stating the obvious here, but as a drug hunter, you want to start testing in the models that are most relevant for the end user at the end of the day. That is patients. This is what I was saying. For Recursion, we actually start with cell lines, primary cell lines and other cell lines. That's really important because we want to understand right off the gate what's happening potentially down the road for patients.
Look, this has been a long time coming. It's a step in the absolute right direction. Biologics as a start. I'm sure there'll be other modalities going forward, but a lot of us in the industry, I have to say, have not been sitting still. Whether it's Recursion, but also other companies, the use of organoids, and there's a lot of work happening in the use of organoids to try to really test out not just in cell lines, but a translational model that's maybe more relevant for what we need. Relevance is really important here.
Then also being able to better understand patient stratification. We didn't talk a lot about that, but patient stratification is really, really critical when you're doing drug discovery and development. Look, for me, I'm in this because we want to make better medicines, but one of the goals that would be amazing is when we understand or we think we understand what's the protein that's driving the disease early, early in discovery, I would love at that point in time to understand which patient will respond versus not. The only way you can do that is to just have better representative models. I think that's where there's still a lot of work to be done. Organoid models is still technological work to get it to be up to par to what we need.
The other thing I'll also say is predictive tox. There's a lot in that. Then predictive PK and all of this Admet work. This is sometimes the uncool step that people don't talk about, but so, so practical and relevant. Very important for biologics, but for small molecules, 40% of drugs don't work out because of tox that you didn't predict. It's even more important for small molecule. At Recursion, through the acquisition of Cyclica, Valence, and then also now combining with Exscientia, a huge amount of focus is on predictive tox models. Can we predict tox early in advance? Can we predict absorption, distribution, and so forth? There's a whole cluster of suite of tools.
Listen, is anything perfect always? No. But it's this recursive learning that we have that we're constantly testing, learning, and creating those proprietary data sets that we have. Automation here is very, very important because it allows you to do it at scale with high quality, fit for focus data sets. I think Recursion is poised. You have the data, you have the compute, you have the algorithms, and enough shots on goal to actually be able to test it out, discovery and development. But look, overall, this is a step in the right direction. I can't wait to see it progress further across more modalities. A lot of work to be done, but it's good to see.
Brendan Smith:
Yeah, I think it really is something almost unprecedented out of FDA these days, and that's not a word that we can throw around lightly in daily conversations in 2025. But here we are. I'm sure it's something we'll continue to discuss both on this podcast moving forward, but also just in conversations with different constituencies across this entire sector, frankly.
Look, we've covered some great ground today. I would say there's an endless supply of topics that we could dive into and could be here forever. But before I let you go, I do want to just ask if everything we've discussed, all the granularity today goes over someone's head, but they've made it with us this far,` what is one point you would really want everyone listening in to remember and take away from our conversation today?
Najat Khan, Ph.D.:
I would say we all know that the success rate is too low, 10%, 10 years. The cost is so high. One thing I'd want everyone to take away with is it's not an option not to try. There's going to be ups and downs. Anytime you blend new disciplines together. Does a model work perfectly the first time? Does a platform work perfectly the first time? Especially when we have so much gap in data in the space, fit-for-purpose data, all of that. But we need to stick with it. Just like if the FDA guidance, which is great to see, it's been in the work for over a decade, but it's great. If somebody stopped trying then, we wouldn't be at this moment where we're really shifting course. A lot of us have been waiting for that and working on it already. That's what I would say.
Sometimes I get a lot of questions like, "Why should we?" We can't afford to do things the way we've always been doing it for all the reasons I mentioned. I love Teddy Roosevelt's quote, which is like, "The men," or I will say, "The in person in the arena," instead of watching and waiting and not trying, let's just roll up our sleeves and figure out how to make this work. That's what I'm focused on, more than will it work or not. It's not an if it will work, it's how and who. Therefore it's going to be those that really dive into it that can make a difference.
Last thing I'll say, please, please, please make sure you learn about AI. There's a lot of people like your neighbor that will ask me, and I always point them to different courses and so forth. But especially with ChatGPT, it's so easy to learn things these days and other forums. Please take it on yourself, lean in, learn, and let's give it our best shot.
Brendan Smith:
As good of a call to action as I can possibly conceive of. thank you so much for joining us today, Najat. It's been a absolute pleasure and I look forward to continuing the conversation moving forward.
Najat Khan, Ph.D.:
Thank you.
This podcast should not be copied, distributed, published or reproduced, in whole or in part. The information contained in this recording was obtained from publicly available sources, has not been independently verified by TD Securities, may not be current, and TD Securities has no obligation to provide any updates or changes. All price references and market forecasts are as of the date of recording. The views and opinions expressed in this podcast are not necessarily those of TD Securities and may differ from the views and opinions of other departments or divisions of TD Securities and its affiliates. TD Securities is not providing any financial, economic, legal, accounting, or tax advice or recommendations in this podcast. The information contained in this podcast does not constitute investment advice or an offer to buy or sell securities or any other product and should not be relied upon to evaluate any potential transaction. Neither TD Securities nor any of its affiliates makes any representation or warranty, express or implied, as to the accuracy or completeness of the statements or any information contained in this podcast and any liability therefore (including in respect of direct, indirect or consequential loss or damage) is expressly disclaimed.

Brendan Smith
Director, Life Science & Diagnostic Tools and Biotech Analyst, TD Cowen
Brendan Smith
Director, Life Science & Diagnostic Tools and Biotech Analyst, TD Cowen
Brendan Smith joined TD Cowen in 2019 and covers life science & diagnostic tools and biotech. He holds an MA, MPhil, and Ph.D. from Columbia.