A generalist vision–language foundation model for diverse biomedical tasks (2024)

Data availability

All data in this study are publicly available and can be accessed from: IU X-ray and Peir Gross (https://github.com/nlpaueb/bioCaption), MedICat (https://github.com/allenai/medicat), PathVQA (https://huggingface.co/datasets/flaviagiammarino/path-vqa), SLAKE 1.0 (https://www.med-vqa.com/slake/), DeepLesion (https://nihcc.app.box.com/v/DeepLesion), OIA-DDR (https://github.com/nkicsl/OIA), CheXpert- v1.0-small (https://www.kaggle.com/datasets/willarevalo/chexpert-v10-small), CytoImageNet (https://www.kaggle.com/datasets/stanleyhua/cytoimagenet), ISIC 2020 (https://challenge2020.isic-archive.com), Retinal Fundus (https://www.kaggle.com/c/diabetic-retinopathy-detection), MIMIC-III Clinic Notes (https://paperswithcode.com/dataset/hospital-admission-notes-from-mimic-iii), NCBI BioNLP (https://www.ncbi.nlm.nih.gov/research/bionlp/Data/), PubMed abstracts derived from the BLUE benchmark (https://github.com/ncbi-nlp/BLUE_Benchmark), VQA-RAD (https://osf.io/89kps/), CBIS-DDSM (https://www.kaggle.com/datasets/awsaf49/cbis-ddsm-breast-cancer-image-dataset), SZ-CXR and MC-CXR (access can be requested via the contact at http://archive.nlm.nih.gov/repos/chestImages.php), MIMIC-CXR (https://physionet.org/content/mimic-cxr-jpg/2.1.0/), MedNLI (https://physionet.org/content/mednli/1.0.0/), TREC 2022 (https://www.trec-cds.org/2022.html), SEER (https://seer.cancer.gov), MIMIC-III (https://physionet.org/content/mimiciii/1.4/), HealthcareMagic (https://huggingface.co/datasets/UCSD26/medical_dialog), MeQSum (https://huggingface.co/datasets/sumedh/MeQSum), MedMNIST v2 (https://medmnist.com) and ROCO (https://github.com/razorx89/roco-dataset). A randomly sampled subset of RSNA Pneumonia Detection Challenge (2018) was used for zero-shot prediction (https://www.rsna.org/rsnai/ai-image-challenge/rsna-pneumonia-detection-challenge-2018). The MedMNIST-Raw is curated using multiple sources, including NCT-CRC-HE-100K (colon pathology) (https://zenodo.org/records/1214456), HAM10000 (dermoscopy) (https://github.com/ptschandl/HAM10000_dataset), OCT and Chest X-ray (https://data.mendeley.com/datasets/rscbjbr9sj/3), breast ultrasound (https://scholar.cu.edu.eg/Dataset_BUSI.zip), blood cell microscopy (https://data.mendeley.com/datasets/snkd93bnjr/1) and Liver Tumor Segmentation Benchmark (LiTS) (https://competitions.codalab.org/competitions/17094). The VQA data for human evaluation are derived from Medical-Diff-VQA (https://physionet.org/content/medical-diff-vqa/1.0.0/), with the exclusion of questions related to differences, as these require a two-image input. Report generation and summarization samples for human evaluations are extracted from MIMIC-CXR. The instruction-following data used in this article are derived from Pubmed (https://pubmed.ncbi.nlm.nih.gov) following the LLaVA-Med approach (https://github.com/microsoft/LLaVA-Med/blob/main/download_data.sh) and are combined with training sets from PathVQA and SLAKE. We also provided the table with more details of the major datasets in Extended Data Table 2.

Code availability

The pretrained and fine-tuned models, as well as source code for training, inference and data preprocessing, can be accessed at https://github.com/taokz/BiomedGPT.

References

Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article CAS PubMed Google Scholar
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Article CAS PubMed Google Scholar
Moody, L. et al. The person-centred care guideline: from principle to practice. J. Patient Exp. 5, 282–288 (2018).
Article PubMed PubMed Central Google Scholar
Langberg, E. M., Dyhr, L. & Davidsen, A. S. Development of the concept of patient-centredness–a systematic review. Patient Educ. Couns. 102, 1228–1236 (2019).
Article PubMed Google Scholar
Bates, D. W. et al. Reducing the frequency of errors in medicine using information technology. J. Am. Med. Inform. Assoc. 8, 299–308 (2001).
Article CAS PubMed PubMed Central Google Scholar
Tu, T. et al. Towards generalist biomedical AI. NEJM AI https://doi.org/10.1056/AIoa2300138 (2024).
Reed, S. et al. A generalist agent. Transact. Mach. Learn. Res. https://openreview.net/pdf?id=1ikK0kHjvj (2022).
Driess, D. et al. Palm-e: an embodied multimodal language model. In Proc. 40th International Conference on Machine Learning 8469–8488 (JMLR.org, 2023).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (Neural Information Processing Systems Foundation, 2017).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
Li, C. et al. Llava-med: training a large language-and-vision assistant for biomedicine in one day. In Advances in Neural Information Processing Systems 36 (Neural Information Processing Systems Foundation, 2024).
Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. Towards generalist foundation model for radiology. Preprint at https://arxiv.org/abs/2308.02463 (2023).
Luo, R. et al. BioGPT: generative pretrained transformer for biomedical text generation and mining. Brief. Bioinform. 23, bbac409 (2022).
Article PubMed Google Scholar
Zhang, S. et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. Preprint at https://arxiv.org/abs/2303.00915 (2023).
Phan, L. N. et al. Scifive: a text-to-text transformer model for biomedical literature. Preprint at https://arxiv.org/abs/2106.03598 (2021).
Lau, J. et al. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5, 180251 (2018).
Article PubMed PubMed Central Google Scholar
Liu, B. et al. Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In Proc. IEEE International Symposium on Biomedical Imaging (ISBI) 1650–1654 (Institute of Electrical and Electronics Engineers, 2021).
He, X. et al. Towards visual question answering on pathology images. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) 708–718 (Association for Computational Linguistics. 2021).
Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23, 304–310 (2016).
Article PubMed Google Scholar
Johnson, A. E. et al. MIMIC-CXR-JPG — chest radiographs with structured labels. PhysioNet 101, 215–220 (2019).
Google Scholar
Pavlopoulos, J., Kougia, V., & Androutsopoulos, I. A survey on biomedical image captioning. In Proc. Second Workshop on Shortcomings in Vision and Language 26–36 (Association for Computational Linguistics, 2019).
Li, P. et al. Self-supervised vision-language pretraining for medial visual question answering. In Proc. IEEE 20th International Symposium on Biomedical Imaging (ISBI) 1–5 (Institute of Electrical and Electronics Engineers, 2023).
Zhang, X. et al. Pmc-vqa: visual instruction tuning for medical visual question answering. Preprint at https://arxiv.org/abs/2305.10415 (2023).
Van Sonsbeek, T. et al. Open-ended medical visual question answering through prefix tuning of language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention 726–736 (MICCAI, 2023).
Lin, C. Y. Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out 74–81 (Association for Computational Linguistics, 2004).
Banerjee, S. & Lavie, A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (eds. Goldstein, J., Lavie, A., Lin, C.-Y. & Voss, C.) 65–72 (Association for Computational Linguistics, 2005).
Vedantam, R., Zitnick, C. L. & Parikh, D. Cider: Consensus-based image description evaluation. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) 4566–4575 (Institute of Electrical and Electronics Engineers, 2015).
Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds. Gurevych, I. & Miyao, Y.) 2577–2586 (Association for Computational Linguistics, 2017).
Chen, Z. et al. Generating radiology reports via memory-driven transformer. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 1439–1449 (Association for Computational Linguistics, 2020).
Liu, F. et al. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 13753–13762 (Institute of Electrical and Electronics Engineers/Computer Vision Foundation, 2021).
Yuan, H. et al. Biobart: pretraining and evaluation of a biomedical generative language model. In Proc. 21st Workshop on Biomedical Language Processing (eds. Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J.) 97–109 (Association for Computational Linguistics, 2022).
Van Veen, D. et al. Radadapt: radiology report summarization via lightweight domain adaptation of large language models. In 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks (eds. Demner-fushman, D., Ananiadou, S. & Cohen, K.) 449–460 (Association for Computational Linguistics, 2023).
Yu, F. et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns 4, 9 (2023).
Article Google Scholar
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. Proc. 56th Annual Meeting of the Association for Computational Linguistics 1 (eds. Gurevych, I. & Miyao, Y.) 2577–2586 (2018).
Yang, J. et al. MedMNIST v2 - a large-scale lightweight benchmark for 2D and 3D biomedical image classification. Sci. Data 10, 41 (2023).
Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg. 4, 475–477 (2014).
Capellán-Martín, D. et al. A lightweight, rapid and efficient deep convolutional network for chest x-ray tuberculosis detection. In Proc. 2023 IEEE 20th Int. Symp. Biomed. Imaging (ISBI) 1–5 (IEEE, 2023).
Manzari, O. N. et al. Medvit: a robust vision transformer for generalized medical image classification. Comput. Biol. Med. 157, 106791 (2023).
Article PubMed Google Scholar
Lee, R. S. et al. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci. Data 4, 1–9 (2017).
Article Google Scholar
Romanov, A. & Shivade, C. Lessons from natural language inference in the clinical domain. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing 1586–1596 (Association for Computational Linguistics, 2018).
Gloeckler Ries, L. A. et al. Cancer survival and incidence from the surveillance, epidemiology, and end results (SEER) program. Oncologist 8, 541–552 (2003).
Article PubMed Google Scholar
Abacha, A. B., & Demner-Fushman, D. On the summarization of consumer health questions. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2228–2234 (2019).
Zeng, G. et al. Meddialog: large-scale medical dialogue datasets. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 9241–9250 (Association for Computational Linguistics, 2020).
Johnson, A. E. et al. MIMIC-III a freely accessible critical care database. Sci. Data 3, 1–9 (2019).
Google Scholar
Dubey, S. et al. Using machine learning for healthcare treatment planning. Front. Artif. Intell. 6, 1124182 (2023).
Roberts, K. et al. Overview of the TREC 2021 clinical trials track. In Proc. Thirtieth Text Retrieval Conference (TREC, 2021).
Van Aken, B. et al. Clinical outcome prediction from admission notes using self-supervised knowledge integration. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 881–893 (Association for Computational Linguistics, 2021).
OpenAI. GPT-4V(ision) system card. OpenAI https://openai.com/research/gpt-4v-system-card (2023).
Wang, P. et al. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. Proc. Int. Conf. Mach. Learn. PMLR 162, 23318–23340 (2022).
Google Scholar
Hu, X. et al. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. In Proc. 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 4156–4165 (Association for Computing Machinery, 2023).
Jeong, J. et al. Multimodal image-text matching improves retrieval-based chest x-ray report generation. In Proc. Medical Imaging with Deep Learning 227 978–990 (Proceedings of Machine Learning Research, 2024).
Fu, S. et al. Assessment of data quality variability across two EHR systems through a case study of post-surgical complications. In Proc. AMIA Joint Summits on Translational Science 196–205 (American Medical Informatics Association, 2022).
See Also
BMA to Open LaToya Ruby Frazier’s Acclaimed Installation More Than Conquerors in November 2024 | Baltimore Museum of Art Evaluating multimodal AI in medical diagnostics 2020 BioImage Analysis Survey: Community experiences and needs for the future
Delbrouck, J. B. et al. Improving the factual correctness of radiology report generation with semantic rewards. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 4348–4360 (Association for Computational Linguistics, 2022).
Yang, H., Lin, J., Yang, A., Wang, P. & Zhou, C. Prompt tuning for unified multimodal pretrained models. In Findings of the Association for Computational Linguistics: ACL 2023 (eds. Rogers, A., Boyd-Graber, J. & Okazaki, N.) 402–416 (Association for Computational Linguistics, 2023).
Chen, Z. et al. Towards understanding the mixture-of-experts layer in deep learning. Adv. Neural Inf. Process. Syst. 35, 23049–23062 (2022).
Google Scholar
Dosovitskiy, A. et al. An image is worth 16×16 words: transformers for image recognition at scale. In International Conference on Learning Representations. (2021).
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pretraining of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds. Burstein, J., Doran, C. & Solorio, T.) 4171–4186 (Association for Computational Linguistics, 2019).
Ke, G. He, D. & Liu, T. Y. Rethinking positional encoding in language pretraining. In International Conference on Learning Representations (ICLR, 2019).
Ba, J. L., Kiros, J. R. & Hinton, G.E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016)
Shleifer, S., Weston, J. & Ott, M., Normformer: Improved transformer pretraining with extra normalization. Preprint at https://arxiv.org/abs/2110.09456 (2021).
Dai, Z., Liu, H., Le, Q. V. & Tan, M. Coatnet: marrying convolution and attention for all data sizes. In Proc. Advances in Neural Information Processing Systems 34 (NeurIPS 2021) 3965–3977 (Neural Information Processing Systems, 2021).
Wang, Z. et al. SimVLM: simple visual language model pretraining with weak supervision. In International Conference on Learning Representations. (International Conference on Learning Representations, 2022).
Esser, P., Rombach, R. & Ommer, B. Taming transformers for high-resolution image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 12873–12883 (Institute of Electrical and Electronics Engineers/Computer Vision Foundation, 2021).
Chen, T. et al. Pix2seq: a language modeling framework for object detection. In International Conference on Learning Representations (International Conference on Learning Representations, 2022).
Gage, P. A new algorithm for data compression. C. Users J. 12, 23–38 (1994).
Google Scholar
He, K. et al. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (Institute of Electrical and Electronics Engineers, 2016).
Wei, J. et al. Finetuned language models are zero-shot learners. In International Conference on Learning Representations (International Conference on Learning Representations, 2022).
Schick, T. & Schütze, H. It’s not just size that matters: small language models are also few-shot learners. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds. Toutanova, K. et al.) 2339-2352 (Association for Computational Linguistics, 2021).
Bao, H. et al. BEiT: BERT pretraining of image transformers. In International Conference on Learning Representations (International Conference on Learning Representations, 2022).
Xu, H. et al. E2E-VLP: end-to-end vision-language pretraining enhanced by visual learning. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (eds. Zong, C. et al.) 503–513 (2021).
Sutskever, I., Vinyals, O. & Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27 (Conference on Neural Information Processing Systems, 2014).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (International Conference on Learning Representations, 2019).
Micikevicius, P. et al. Mixed precision training. In International Conference on Learning Representations (International Conference on Learning Representations, 2018).
Raghu, M. et al. Transfusion: understanding transfer learning for medical imaging. In Advances in Neural Information Processing Systems 32 (Conference on Neural Information Processing Systems, 2019).
Zhou, C. et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT. Preprint at https://arxiv.org/abs/2302.09419 (2023).

Download references

Acknowledgements

NSF grant CRII-2246067, NSF POSE: Phase II-2346158 and Lehigh Grant FRGS00011497 supported L.S., K.Z., Z.Y. and Y.L. NIH grant R21EY034179, NSF grants NCS-2319451, MRI-2215789 and IIS-1909879, as well as Lehigh’s Accelerator and CORE grants S00010293 and S001250, supported L.H. and R.Z. NIH grants R01HL159183 and RF1AG057892 supported Q.L. NIH grant R03AG078625 supported X.L. NIH grants R01EB19403 and R01LM11934, supported S.F. and H.L. Icons used in Fig. 2 were made by Freepike, surang, Smartline and Blackonion02 at www.flaticon.com.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA, USA
Kai Zhang,Rong Zhou,Eashan Adhikarla,Zhiling Yan,Yixin Liu,Jun Yu,Brian D. Davison,Lifang He&Lichao Sun
School of Computing, University of Georgia, Athens, GA, USA
Zhengliang Liu&Tianming Liu
Samsung Research America, Mountain View, CA, USA
Xun Chen
Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
Hui Ren,Xiang Li&Quanzheng Li
Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
Jing Huang&Yong Chen
PolicyLab, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
Jing Huang
Center for Research in Computer Vision, University of Central Florida, Orlando, FL, USA
Chen Chen
Department of Computer Science and Engineering, University of California, Santa Cruz, CA, USA
Yuyin Zhou
McWilliams School of Biomedical Informatics, UTHealth, Houston, TX, USA
Sunyang Fu&Hongfang Liu
Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, USA
Wei Liu
The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania, Philadelphia, PA, USA
Yong Chen
Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA, USA
Yong Chen
Leonard Davis Institute of Health Economics, Philadelphia, PA, USA
Yong Chen
Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
James Zou
Department of Computer Science, Stanford University, Stanford, CA, USA
James Zou

Authors

Kai Zhang
View author publications
You can also search for this author in PubMedGoogle Scholar
Rong Zhou
View author publications
You can also search for this author in PubMedGoogle Scholar
Eashan Adhikarla
View author publications
You can also search for this author in PubMedGoogle Scholar
Zhiling Yan
View author publications
You can also search for this author in PubMedGoogle Scholar
Yixin Liu
View author publications
You can also search for this author in PubMedGoogle Scholar
Jun Yu
View author publications
You can also search for this author in PubMedGoogle Scholar
Zhengliang Liu
View author publications
You can also search for this author in PubMedGoogle Scholar
Xun Chen
View author publications
You can also search for this author in PubMedGoogle Scholar
Brian D. Davison
View author publications
You can also search for this author in PubMedGoogle Scholar
Hui Ren
View author publications
You can also search for this author in PubMedGoogle Scholar
Jing Huang
View author publications
You can also search for this author in PubMedGoogle Scholar
Chen Chen
View author publications
You can also search for this author in PubMedGoogle Scholar
Yuyin Zhou
View author publications
You can also search for this author in PubMedGoogle Scholar
Sunyang Fu
View author publications
You can also search for this author in PubMedGoogle Scholar
Wei Liu
View author publications
You can also search for this author in PubMedGoogle Scholar
Tianming Liu
View author publications
You can also search for this author in PubMedGoogle Scholar
Xiang Li
View author publications
You can also search for this author in PubMedGoogle Scholar
Yong Chen
View author publications
You can also search for this author in PubMedGoogle Scholar
Lifang He
View author publications
You can also search for this author in PubMedGoogle Scholar
James Zou
View author publications
You can also search for this author in PubMedGoogle Scholar
Quanzheng Li
View author publications
You can also search for this author in PubMedGoogle Scholar
Hongfang Liu
View author publications
You can also search for this author in PubMedGoogle Scholar
Lichao Sun
View author publications
You can also search for this author in PubMedGoogle Scholar

Contributions

K.Z. and L.S. designed the study. K.Z., R.Z. and E.A. carried out data collection, data preprocessing, model construction and model validation. J.Y., Z.Y., Y.L. and Z.L. carried out the data analysis benchmarking results. X.C., B.D.D., J.H., C.C., Y.Z., S.F., W.L., T.L., X.L., Y.C., L.H., J.Z., Q.L. and H.L. provided knowledge support and interpreted the findings. H.R. carried out the human evaluation for the generated text from BiomedGPT as well as GPT-4V. L.S. provided knowledge support, interpreted the findings and supervised the study. All authors contributed to manuscript writing and reviewed and approved the final version. L.H., X.L. and L.S. co-supervised the study.

Corresponding authors

Correspondence to Xiang Li, Lifang He or Lichao Sun.

Ethics declarations

Competing interests

The research was conducted independently of any commercial or financial relationships that could be construed as a potential conflict of interest. Although X.C. is employed by Samsung, the company was not involved in any aspect of this research. The other authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Statistics of pretraining and fine-tuning datasets.

(a) Modality distribution of pretraining data used in BiomedGPT. (b) For the training and testing splits of datasets used in downstream fine-tuning, we typically follow the format of number of training samples/number of validation samples/number of test samples to detail each dataset. More details of the data split are described in Supplementary Table 7.

Extended Data Fig. 2 Overview of BiomedGPT’s model configuration and architecture.

(a) Detailed model configuration of BiomedGPT. Here, ‘#’ indicates number of. ‘Att.’, ‘Enc.’ and ‘Dec.’ indicate Attention, Encoder and Decoder, respectively. The hidden size is the size of the embeddings and the size of the output of each self-attention and feed-forward layer. The first layer of FFN expands the hidden size to the intermediate size, and the second layer contracts it back to the hidden size. This expansion and contraction allow the network to create more complex representations. During the pretraining phase, image processing involves resizing and cropping the images to varying resolutions, corresponding to the input sizes listed in the table. It should be noted that during fine-tuning and inference stages, the input resolution of BiomedGPT can be flexibly adjusted according to the specific requirements of the task. (b) The neural network architecture of BiomedGPT, which includes bidirectional encoder blocks and autoregressive decoder blocks. The number of blocks varies for different model scales.

Extended Data Fig. 3 The graphical illustrations of the key components in BiomedGPT.

(a) Head-scale multi-head attention module in BiomedGPT. The trainable parameters γ_h is applied prior to the output projection for each head. (b) Instead of adding the absolute positional embedding P_i to the input embedding I_i (left), we compute the positional correlation and input correlation separately with different projection matrices and add them together in the self-attention module (right). (c) Graphical illustration of relative position bias. Such an inductive bias B_j-i is learnable parameter and can be viewed as the embedding of the relative position j−i, which is injected into the Query-Key product: \(\frac{1}{\sqrt{d}}({I}_{i}{W}^{\,Q})({P}_{i}{W}^{\,K})+{B}_{j-i}\), and shared in all layers. (d) An example of trie-based beam search: along the path across ‘Lipid’ and ‘breakdown’, BiomedGPT sets logits for all invalid tokens (‘mechanism’ and ‘pathway’) to −∞ when computing log-probabilities for the target token ‘in’. It is worth noting that trie-based search is also applied during the validation phase of the fine-tuning stage for acceleration (approximately 16× increase in speed in our experiments).

Extended Data Fig. 4 Comparative Performance of BiomedGPT and Med-PaLM M and the prompt tuning results in Image classification.

(a) Comparison between BiomedGPT-B and Med-PaLM M on CBIS-DDSM dataset. (b) The experimental results of prompt tuning BiomedGPT-B on three image classification datasets. Prompt tuning learns ‘soft prompts’ or extra model parameters for each task instead of making a task-specific copy of the entire pretrained model for each downstream task and inference must be performed in separate batches. We must mention that the addition of soft prompts is contrary to the design principle of the generalist model. We injected two prompt layers into the encoder and decoder, and varied the prompt length {20, 40, 60, 80, 100, 120} to investigate the performance comparison against full-model fine-tuning. The preliminary results of ‘Colon pathology’, ‘Blood cell microscope’, and ‘Chest X-ray’ were obtained after 100, 512, and 55 training epochs respectively, all with a consistent batch size of 512. We observed that as the prompt length increases, the model performance tends to improve. However, despite an increased number of tuning epochs compared with fine-tuning on the original BiomedGPT (Fig. 3c), the performance after prompt tuning notably lags behind that of model fine-tuning. Specifically, considering only the best results in prompt tuning, there are substantial accuracy reductions of 32.3%, 54.6%, and 32.6% on these three datasets, respectively.

Extended Data Fig. 5 Additional zero-shot results of BiomedGPT.

(a) Graphical illustration of zero-shot classification using CLIP-style models, linear probing transfer learning using VIT or BERT-style models, and zero-shot generation of BiomedGPT. Notably, our model can generate the response without providing additional components such as the label candidates for CLIP or linear classifier requiring training for ViT. (b) Zero-shot performance on five disease diagnosis tasks. (c) BiomedGPT shows competitive zero-shot performance compared with Med-PaLM M with a much smaller model scale. The SOTA fine-tuned model for TB detection is TBLightNet. Note that no single model consistently outperforms the others across all four metrics used in report generation. Here, SOTAs represent the best performance achieved in each specific metric. We fine-tuned our pretrained BiomedGPT-B on MultiMedBench, which Med-PaLM M proposed and used for fine-tuning based on the pretrained PaLM-E. We also attempted to fine-tune LLaVA-Med; however, the time and computational costs were prohibitive due to the large scale of the model and data. Therefore, we reported the results using the pretrained checkpoint of LLaVA-Med.

Full size table

Supplementary information

Supplementary Information

Supplementary Figs. 1–9 and Supplementary Tables 1–7.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, K., Zhou, R., Adhikarla, E. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat Med (2024). https://doi.org/10.1038/s41591-024-03185-2

Download citation

Received: 29 January 2024
Accepted: 10 July 2024
Published: 07 August 2024
DOI: https://doi.org/10.1038/s41591-024-03185-2