The Case for Data Sharing with The Broad Institute
Guest: Dr. Shantanu Singh, Senior Group Leader, The Broad Institute
Host: Brendan Smith, Director, Life Science & Diagnostic Tools and Biotech Analyst, TD Cowen
In this episode, TD Cowen's Health Care Analyst Brendan Smith hosts Dr. Shantanu Singh, Senior Group Leader at the Broad Institute, to explore the emerging use cases for AI in basic science research and the recent increase in data sharing initiatives across the sector. We discuss how researchers are countering key bottlenecks with novel data sharing techniques and what this paradigm shift could mean for the evolution of health care AI and the industry as a whole.
This podcast was originally recorded on March 4, 2026.
Speaker 1:
Welcome to TD Cowen Insights, a space that brings leading thinkers together to share insights and ideas shaping the world around us. Join us as we converse with the top minds who are influencing our global sectors.
Brendan Smith:
All right. We are here live at the 46th annual TD Cowen Healthcare Conference. I'm your host and TD Cowen Healthcare Analyst, Brendan Smith, and welcoming you back to another episode of Machine Medicine, AI and Healthcare. This is TD Cowen's podcast series where we bring you the latest and most important takeaways from the state of AI and the healthcare sector today.
Today, I'm joined by the Broad Institute, Dr. Shantanu Singh. Dr. Singh, it's great to have you and welcome.
Dr. Shantanu Singh:
Thank you.
Brendan Smith:
So for anyone new to our podcast series, machine medicine aims to break down the use of artificial intelligence across healthcare into bite-sized digestible points one episode at a time, highlight the biggest misconceptions, and then recontextualize each piece back into the bigger picture. And today, Dr. Singh and I are exploring how the collective pool of AI researchers within healthcare are countering their key data bottlenecks within the evolving research landscape with next gen tools, and some novel data sharing technologies, and realistically what this paradigm shift could mean for the life sciences industry as a whole.
So Dr. Singh, let's just dive right in. Maybe just start, could you walk us through how you're using AI and data powered tools in your day-to-day workflow and really how the advancement and integration of these technologies has transformed your area of expertise?
Dr. Shantanu Singh:
Yeah. So our work is primarily focused on using images as a way of reading out the state of a biological system, and typically cells or sometimes tissue. And the field that has enabled that, technologic field that's enabled that, is computer vision, the ability of computers to look at images and identify structures on them and extract measurements from them.
So for the longest time, the field was all about the technologies needed for that, were computer algorithms, image processing algorithms that were used to go and identify cells in an image and do all those things, extract measurements and then read out the state of the cell using those measurements.
And then, there are methods downstream of that, the data analysis and machine learning methods that are used to take all those data and address problems that allow you to find new drugs, in a sense. And there's a whole research program, and we'll talk about it, about how these readouts can be used to study biology.
But there's been a revolution over the last 10 years on both those fronts. Our ability now to, for example, given an image, identify cells automatically without a human needing to tune those algorithms has just become dramatically better. And then downstream of that, using machine learning methods that allow us to do a better job of making those predictions.
So in some sense, AI in that form, it's fundamental to our science. It's not an add-on. So there's been that transformation there because of these new methods. But then also separately, there's agentic coding tools. For example, like Claude Code, those of you could have heard about it, they're already integrated into our daily workflow. So they're writing and maintaining analysis pipelines, they're generating documentation, building infrastructure. And I think in that sense, over the last year really, that's been the biggest productivity lift so far.
Brendan Smith:
Yeah. And I think you're kind of touching on a lot of important points that frankly we're asked about all the time, but it really kind of in the realm of benchmarking. So when it actually comes to benchmarking some of these improvements, where do you see some of the most meaningful ROI within your own research projects? And are there really any tools that you're building or leveraging that you feel have pretty outsized potential for disrupting how you and your colleagues operate across the board?
Dr. Shantanu Singh:
Sure. Yeah. So in some sense, our field itself is the disruption. Using images as a readout of cell state, it gives us a very cheap, fast, and easy way of reading out what any biological perturbation in a chemical, or perturbation, or a genetic change, could be doing to a cell.
So there's these high-dimensional readouts, the high-dimensional measurements of cell state. You can think of them as a fingerprint or a signature of cells. The way of generating that has just become so inexpensive. And that is a fundamentally different way of doing drug discovery and biology as a result. Essentially, what we do is we take pictures of cells that have been perturbed or treated with chemicals. And then the picture, we treat cells with chemicals, or knockout a gene, or something like that, that we want to study.
And then we stain them with different colored dyes and fluorescein dyes that glow when you shoot lasers at them. And then we take a lot of the pictures of those cells, and those pictures then become the data that we use.
So that in some sense is transformative because it's really, really cheap to do that versus do a lot of the other more complicated, sophisticated ways of extracting measurements from them. And the impact of that has been manifold. We, for example, can use that to identify, to predict the mechanism of action of a drug. What is the drug potentially doing in a system doing virtual screening, that is you have these really expensive assays that are run in pharma to predict the activity of a compound. Those assays can be potentially not replaced, but at least virtually predicted, by taking images as the input and taking a lot of training data as the output of existing legacy data sets that they may have, and then just coming up with predictive models that might be able to replace those virtual assays.
So in some senses, through all this, we have given biology a cheap and universal readout, and that sort of really changes the economics of everything downstream of that.
Brendan Smith:
And I mean, this kind of naturally lends itself to a lot of questions about the interplay between different researchers across the sector. So we've seen, recently, a lot of blurring of the lines around some of the traditional data silos between big pharma and some of these more AI-centric platform providers. So I guess, from where you sit, how has the industry-wide shift towards AI ML impacted the relationship between smaller labs, platform providers, and pharma companies? And maybe how do you expect this to continue to evolve as these technologies also advance?
Dr. Shantanu Singh:
So pharma's moat, as I say, it hasn't disappeared, but it's certainly shifted. So the real irreplaceable parts asset is the clinical outcome data from phase one to phase three, all those trial results, linked back to molecular data and phenotypic profiles. And no amount of the kind of data that we generate, what we call cell painting data, cells or alpha-fold data, prediction of protein structure or the binding, can give you that.
So preclinical data is really where the moat has eroded the most, and I'll talk about that. But I think the clinical data is where I think it's barely bunched. But with the cost of producing these large scale preclinical data has dropped dramatically. So, methods like from a lab, like cell painting, but many other such technologies allow even smaller labs to generate massive amounts of data cheaply. You can now relatively cheaply profile compound collections of thousands or tens of thousands, with the reasonable budget that an academic lab can offer. And those give you very rich readouts that allow you to make a lot of predictions downstream.
In silico approaches, for example, Alpha 43 and what is being called Alpha 44, isomorphics, lab structures, prediction thingy, which seems amazing, they sort of let you scale computationally without wet lab infrastructure. Companies like Recursion have built proprietary imaging datasets that rival pharma and scale for this classic modality.
And also say that public data sort of disproportionately helps smaller players. The kind of datasets we've generated, 700 terabytes of imaging data that has been put out there, it lets smaller companies and economic labs hone methods, benchmark them, and compete. Pharma already had a lot of internal data. So for resource constrained groups, this is really a game changer in that sense.
And I think with even the preclinical model where there's a variation, I think where it's the thinnest is in the single modality preclinical data, like if you just take imaging alone, or transcriptional profiling alone, where you're reading out states of gene expression, I think that is where it's the thinnest. But where it's the thickest is multimodal data where you take imaging data and other omics data and clinical data, that's where a pharma's breadth across the data types is really hard to. So in a sense, I mean, just zoom backward, public data, they haven't leveled the playing field. I think they've actually built the playing field. It allows small companies, could get into the game, and they couldn't have gotten into the game before.
Brendan Smith:
Yeah. I mean, this conversation about playing fields and different relative contributions of some of the data stacks too, I think is notable just given the uptick in data sharing initiatives really between pharma, biotech, even academic research institutions across these kind of different AI platforms as well.
So I guess kind of related to that last question, I'm curious, how has the shift towards some of these more advanced data sharing techniques, including Federated Learning, which I know we've spoken about previously, impacted the kinds of work you and your colleagues are able to do. And maybe how does that differ, realistically, from some of the more common open sourcing efforts tied to traditional academic?
Dr. Shantanu Singh:
So there's sort of two complimentary models, and we do one, but I can speak about both. So our model is structured pre-competitive consortia. So for example, the jump cell painting consortium, focused on creating the world's largest cell painting dataset, was a consortium of 10 pharma companies that got together to create compound collection of about 120,000 compounds and about three-fourths of the human genome either knocked out or overexpressed or perturbed in some way.
And in that we had these 10 pharma companies together agreeing on a set of compounds, they're going to profile shipping compounds to each other, and then producing data at each of their centers, and then sharing all that data publicly a year later. So there's that.
There's the OASIS Consortium, a consortium that's been created to be able to predict liver toxicity without animal testing, or rather be able to reduce, potentially eventually replace, animal testing for drug discovery. That is a consortium right now of more than 50 institutions that have gotten together and we sort of collaborate on coming up with experimental design. There's a smaller number of data producing centers, but essentially we are doing that against a shared compound set, and that sort of required years of trust building, novel governance methods and IP frameworks and things like that.
So that's the approach we have taken, right? Structured pre-competitive consortia producing data that is made eventually public. But then this Federated Learning, as you said, in that case, model travels to each organization's data. Only the learned parameters are transferred back, are shared back, and raw data never really leaves. We've talked about that before. Eli Lilly's TuneLab, there are I think about 16 models or something like that, that were built on what they estimate to be a $1 billion worth of internal research of internal data.
And biotech partners like Insitro and Circle Pharma and others, they improve the models without Eli ever seeing those compounds. So that is a different model. And I think both serve different needs. There's a pre-competitive for public resource and shared benchmarks and Federated for proprietary libraries where companies can't share raw data. And they're sort of compliments in that sense, not really competitors.
So the concrete impact of our work, of course, has just been that now we've got, we just talk about the first one, the pre-competitive consortia, that's just allowed companies and themselves to understand how these data produced, those who don't have that expertise, but then also other labs to then now tap into that because this is all released with CC0 license. So anyone can use it for, literally for anything.
Brendan Smith:
We're already kind of touching on some of this, and I can suspect what your answer will be realistically, but I guess just at a high level, what do you see as some of the biggest incentives, I guess, for both parties to enter into these kinds of data sharing collaborations? And I guess maybe better put, what's the biggest draw for you and your colleagues relative to how others in the industry would be viewing this approach, just given the historical cadence towards competition?
Dr. Shantanu Singh:
That's really a personal journey for many of us, having done that over the last many years. The way I see it, and this may differ from individual to individual, I think for me, the data itself isn't the most valuable part. It's the process of creating it together. So just designing the assay, figuring out quality control, processing the analysis, that shared learning I think is really what accelerates the field.
And like I said, there was these two consortium modes. Both of them were community driven. The Jump Consortium really had all the data generating centers, all the companies generating data together across a complementary set of compounds. And so that sort of forces everyone to align on the protocols, learn from each other's mistakes, and standardize quality. And the collective knowledge building from that was real output.
Somewhat similarly in the OASIS Consortium where we're trying to predict liver toxicity just from looking at cells and studying them, that forces... In that case, we have partners that provide input on design, they collaborate on the analysis, they publish together, and it's still deeply collaborative with just that there are fewer data generating centers.
So I think in both cases, the incentive really is the community, hearing how others approach the problem, learning together, and raising everyone's quality bar. So in that sense, I think the data is really the artifact, but the shared expertise there is really the lasting value.
Brendan Smith:
Yeah. And I think, look, realistically, we're all coming at this conversation, different backgrounds, different priorities, different angles, different experiences, frankly. And I think this is a topic that we perpetually need to revisit, but I'm kind of curious in the context of what we discussed here, and given your experience across these different consortia and across your different research colleague groups, what do you feel from your viewpoint, maybe the investment community and folks like myself, frankly, anyone who is not actually using a lot of these tools every day, likely underappreciate, or altogether kind of misunderstand, about AI within your work that you think is especially important for, really any investor to understand. And here we are in March 2026.
Dr. Shantanu Singh:
I'd love to spread this message, the fact that images are still an underestimated source of data, underestimated as a data modality. So for the longest time, cell images, they look beautiful, but they weren't really considered computable in the way genomic data are. And you look at an image, you can recognize cells, but the idea that you can then group compounds based on shared mechanism, or be able to predict the gene pathway membership, just based on looking at those cell images alone, that was not obvious.
And we spent, and several of our colleagues have spent almost more than a decade building that evidence, and I don't think it's sunk in yet. The other thing is that, I bet your colleagues have been hearing a lot about foundation models in biology. And I think fundamentally foundation models in biology aren't where language models are, and they're not even close right now.
The story is different for protein language models, I would say. There's just a lot of protein data out there, and they're sort of much more amenable to the same type of model training as language models are. So that's a slightly different story.
But if you look at biological foundation models for practically all the other data types, it's not even close. So I think there's definitely a lot of marketing and there's a lot of need for building that up, but it is not there yet. And the reality on the ground is that companies have to work with custom bespoke models on the data they have collected, tuned to the specific problems they need to solve. They're still foundational for that company, but they're not foundational in the way Claude Opus is, for example.
And then I would say that the boring stuff matters most. And I suspect investors would get this, but I think they might still find it surprising, the fact that careful data curation, protocol standardization, quality control, if you want to control all the problems that are happening upstream, can boost signal early, and you can sort of dramatically reduce how much machine learning you might need later downstream to fix all that stuff.
And then, of course, this is probably also obvious too, the fact that data production is a bottleneck, not AI. So producing structured, reproducible, biological data at scale, is harder and more expensive than training models on it. Investors, I think, tend to focus a lot on the models, but the real constraint is the data, and I think there's a growing appreciation for it.
So in some senses, I guess it's surprising, the unappreciated thing is just that cell images that they've been hiding in plain sight as far as our work is concerned. We have spent a decade proving they carry as much biology as genomic information at a fraction of the cost.
Brendan Smith:
Yeah. I mean, that's fantastic. I think it's a lot of important bits to kind of chew on there. So, I know we've covered a lot of great ground today, and it's a conversation I'm sure we'll continue to have together and really across the industry for the foreseeable future.
But before I let you go, one thing I like to ask all of our guests is, if everything we've discussed today goes over someone's head, whether by an inch or by a mile, but they've made it with us this far, what is the one point you would really want everyone listening in to remember and take away from our conversation today?
Dr. Shantanu Singh:
If you don't mind, I'll make it two, but sort of join, which is that the rate at which we can produce massive biological data has gone up dramatically. And the nature of the data we deal with is actually fairly easy to understand in the sense that they're really pictures of cells under a microscope painted with dyes that stick to major structures and that's it.
And the other thing is that when you make it cheap and you make it simple and you describe it well, any lab can do it. And that really democratizes science. And when everyone is using the same method, you get a common language, and it's a shared way of probing biology together. And that's really what our lab has been pioneering. We gave biology a common language that's cheap enough for any lab, and it's rich enough for AI.
Brendan Smith:
All right. And with that, I want to thank you very much for hopping on and talking us through what really is kind of the cutting edge of this marriage between software and healthcare tech innovation. I'm sure we'll have plenty more to discuss over the weeks and months ahead as it pertains to all of this, but really thank you for joining us and thank you to everyone for listening.
Dr. Shantanu Singh:
Thanks, Brendan.
Speaker 1:
Thanks for joining us. Stay tuned for the next episode of TD Cowen Insights.
This podcast should not be copied, distributed, published or reproduced, in whole or in part. The information contained in this recording was obtained from publicly available sources, has not been independently verified by TD Securities, may not be current, and TD Securities has no obligation to provide any updates or changes. All price references and market forecasts are as of the date of recording. The views and opinions expressed in this podcast are not necessarily those of TD Securities and may differ from the views and opinions of other departments or divisions of TD Securities and its affiliates. TD Securities is not providing any financial, economic, legal, accounting, or tax advice or recommendations in this podcast. The information contained in this podcast does not constitute investment advice or an offer to buy or sell securities or any other product and should not be relied upon to evaluate any potential transaction. Neither TD Securities nor any of its affiliates makes any representation or warranty, express or implied, as to the accuracy or completeness of the statements or any information contained in this podcast and any liability therefore (including in respect of direct, indirect or consequential loss or damage) is expressly disclaimed.
Brendan Smith
Director, Life Science & Diagnostic Tools and Biotech Analyst, TD Cowen
Brendan Smith
Director, Life Science & Diagnostic Tools and Biotech Analyst, TD Cowen
Brendan Smith joined TD Cowen in 2019 and covers life science & diagnostic tools and biotech. He holds an MA, MPhil, and Ph.D. from Columbia.
Japan