A generalist vision–language foundation model for diverse biomedical tasks (2024)

Data availability

All data in this study are publicly available and can be accessed from: IU X-ray and Peir Gross (https://github.com/nlpaueb/bioCaption), MedICat (https://github.com/allenai/medicat), PathVQA (https://huggingface.co/datasets/flaviagiammarino/path-vqa), SLAKE 1.0 (https://www.med-vqa.com/slake/), DeepLesion (https://nihcc.app.box.com/v/DeepLesion), OIA-DDR (https://github.com/nkicsl/OIA), CheXpert- v1.0-small (https://www.kaggle.com/datasets/willarevalo/chexpert-v10-small), CytoImageNet (https://www.kaggle.com/datasets/stanleyhua/cytoimagenet), ISIC 2020 (https://challenge2020.isic-archive.com), Retinal Fundus (https://www.kaggle.com/c/diabetic-retinopathy-detection), MIMIC-III Clinic Notes (https://paperswithcode.com/dataset/hospital-admission-notes-from-mimic-iii), NCBI BioNLP (https://www.ncbi.nlm.nih.gov/research/bionlp/Data/), PubMed abstracts derived from the BLUE benchmark (https://github.com/ncbi-nlp/BLUE_Benchmark), VQA-RAD (https://osf.io/89kps/), CBIS-DDSM (https://www.kaggle.com/datasets/awsaf49/cbis-ddsm-breast-cancer-image-dataset), SZ-CXR and MC-CXR (access can be requested via the contact at http://archive.nlm.nih.gov/repos/chestImages.php), MIMIC-CXR (https://physionet.org/content/mimic-cxr-jpg/2.1.0/), MedNLI (https://physionet.org/content/mednli/1.0.0/), TREC 2022 (https://www.trec-cds.org/2022.html), SEER (https://seer.cancer.gov), MIMIC-III (https://physionet.org/content/mimiciii/1.4/), HealthcareMagic (https://huggingface.co/datasets/UCSD26/medical_dialog), MeQSum (https://huggingface.co/datasets/sumedh/MeQSum), MedMNIST v2 (https://medmnist.com) and ROCO (https://github.com/razorx89/roco-dataset). A randomly sampled subset of RSNA Pneumonia Detection Challenge (2018) was used for zero-shot prediction (https://www.rsna.org/rsnai/ai-image-challenge/rsna-pneumonia-detection-challenge-2018). The MedMNIST-Raw is curated using multiple sources, including NCT-CRC-HE-100K (colon pathology) (https://zenodo.org/records/1214456), HAM10000 (dermoscopy) (https://github.com/ptschandl/HAM10000_dataset), OCT and Chest X-ray (https://data.mendeley.com/datasets/rscbjbr9sj/3), breast ultrasound (https://scholar.cu.edu.eg/Dataset_BUSI.zip), blood cell microscopy (https://data.mendeley.com/datasets/snkd93bnjr/1) and Liver Tumor Segmentation Benchmark (LiTS) (https://competitions.codalab.org/competitions/17094). The VQA data for human evaluation are derived from Medical-Diff-VQA (https://physionet.org/content/medical-diff-vqa/1.0.0/), with the exclusion of questions related to differences, as these require a two-image input. Report generation and summarization samples for human evaluations are extracted from MIMIC-CXR. The instruction-following data used in this article are derived from Pubmed (https://pubmed.ncbi.nlm.nih.gov) following the LLaVA-Med approach (https://github.com/microsoft/LLaVA-Med/blob/main/download_data.sh) and are combined with training sets from PathVQA and SLAKE. We also provided the table with more details of the major datasets in Extended Data Table 2.

Code availability

The pretrained and fine-tuned models, as well as source code for training, inference and data preprocessing, can be accessed at https://github.com/taokz/BiomedGPT.

References

  1. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).

    Article CAS PubMed Google Scholar

  2. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

    Article CAS PubMed Google Scholar

  3. Moody, L. et al. The person-centred care guideline: from principle to practice. J. Patient Exp. 5, 282–288 (2018).

    Article PubMed PubMed Central Google Scholar

  4. Langberg, E. M., Dyhr, L. & Davidsen, A. S. Development of the concept of patient-centredness–a systematic review. Patient Educ. Couns. 102, 1228–1236 (2019).

    Article PubMed Google Scholar

  5. Bates, D. W. et al. Reducing the frequency of errors in medicine using information technology. J. Am. Med. Inform. Assoc. 8, 299–308 (2001).

    Article CAS PubMed PubMed Central Google Scholar

  6. Tu, T. et al. Towards generalist biomedical AI. NEJM AI https://doi.org/10.1056/AIoa2300138 (2024).

  7. Reed, S. et al. A generalist agent. Transact. Mach. Learn. Res. https://openreview.net/pdf?id=1ikK0kHjvj (2022).

  8. Driess, D. et al. Palm-e: an embodied multimodal language model. In Proc. 40th International Conference on Machine Learning 8469–8488 (JMLR.org, 2023).

  9. Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (Neural Information Processing Systems Foundation, 2017).

  10. Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

    Google Scholar

  11. Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).

  12. Li, C. et al. Llava-med: training a large language-and-vision assistant for biomedicine in one day. In Advances in Neural Information Processing Systems 36 (Neural Information Processing Systems Foundation, 2024).

  13. Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. Towards generalist foundation model for radiology. Preprint at https://arxiv.org/abs/2308.02463 (2023).

  14. Luo, R. et al. BioGPT: generative pretrained transformer for biomedical text generation and mining. Brief. Bioinform. 23, bbac409 (2022).

    Article PubMed Google Scholar

  15. Zhang, S. et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. Preprint at https://arxiv.org/abs/2303.00915 (2023).

  16. Phan, L. N. et al. Scifive: a text-to-text transformer model for biomedical literature. Preprint at https://arxiv.org/abs/2106.03598 (2021).

  17. Lau, J. et al. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5, 180251 (2018).

    Article PubMed PubMed Central Google Scholar

  18. Liu, B. et al. Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In Proc. IEEE International Symposium on Biomedical Imaging (ISBI) 1650–1654 (Institute of Electrical and Electronics Engineers, 2021).

  19. He, X. et al. Towards visual question answering on pathology images. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) 708–718 (Association for Computational Linguistics. 2021).

  20. Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23, 304–310 (2016).

    Article PubMed Google Scholar

  21. Johnson, A. E. et al. MIMIC-CXR-JPG — chest radiographs with structured labels. PhysioNet 101, 215–220 (2019).

    Google Scholar

  22. Pavlopoulos, J., Kougia, V., & Androutsopoulos, I. A survey on biomedical image captioning. In Proc. Second Workshop on Shortcomings in Vision and Language 26–36 (Association for Computational Linguistics, 2019).

  23. Li, P. et al. Self-supervised vision-language pretraining for medial visual question answering. In Proc. IEEE 20th International Symposium on Biomedical Imaging (ISBI) 1–5 (Institute of Electrical and Electronics Engineers, 2023).

  24. Zhang, X. et al. Pmc-vqa: visual instruction tuning for medical visual question answering. Preprint at https://arxiv.org/abs/2305.10415 (2023).

  25. Van Sonsbeek, T. et al. Open-ended medical visual question answering through prefix tuning of language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention 726–736 (MICCAI, 2023).

  26. Lin, C. Y. Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out 74–81 (Association for Computational Linguistics, 2004).

  27. Banerjee, S. & Lavie, A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (eds. Goldstein, J., Lavie, A., Lin, C.-Y. & Voss, C.) 65–72 (Association for Computational Linguistics, 2005).

  28. Vedantam, R., Zitnick, C. L. & Parikh, D. Cider: Consensus-based image description evaluation. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) 4566–4575 (Institute of Electrical and Electronics Engineers, 2015).

  29. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds. Gurevych, I. & Miyao, Y.) 2577–2586 (Association for Computational Linguistics, 2017).

  30. Chen, Z. et al. Generating radiology reports via memory-driven transformer. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 1439–1449 (Association for Computational Linguistics, 2020).

  31. Liu, F. et al. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 13753–13762 (Institute of Electrical and Electronics Engineers/Computer Vision Foundation, 2021).

  32. Yuan, H. et al. Biobart: pretraining and evaluation of a biomedical generative language model. In Proc. 21st Workshop on Biomedical Language Processing (eds. Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J.) 97–109 (Association for Computational Linguistics, 2022).

  33. Van Veen, D. et al. Radadapt: radiology report summarization via lightweight domain adaptation of large language models. In 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks (eds. Demner-fushman, D., Ananiadou, S. & Cohen, K.) 449–460 (Association for Computational Linguistics, 2023).

  34. Yu, F. et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns 4, 9 (2023).

    Article Google Scholar

  35. Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

  36. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. Proc. 56th Annual Meeting of the Association for Computational Linguistics 1 (eds. Gurevych, I. & Miyao, Y.) 2577–2586 (2018).

  37. Yang, J. et al. MedMNIST v2 - a large-scale lightweight benchmark for 2D and 3D biomedical image classification. Sci. Data 10, 41 (2023).

  38. Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg. 4, 475–477 (2014).

  39. Capellán-Martín, D. et al. A lightweight, rapid and efficient deep convolutional network for chest x-ray tuberculosis detection. In Proc. 2023 IEEE 20th Int. Symp. Biomed. Imaging (ISBI) 1–5 (IEEE, 2023).

  40. Manzari, O. N. et al. Medvit: a robust vision transformer for generalized medical image classification. Comput. Biol. Med. 157, 106791 (2023).

    Article PubMed Google Scholar

  41. Lee, R. S. et al. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci. Data 4, 1–9 (2017).

    Article Google Scholar

  42. Romanov, A. & Shivade, C. Lessons from natural language inference in the clinical domain. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing 1586–1596 (Association for Computational Linguistics, 2018).

  43. Gloeckler Ries, L. A. et al. Cancer survival and incidence from the surveillance, epidemiology, and end results (SEER) program. Oncologist 8, 541–552 (2003).

    Article PubMed Google Scholar

  44. Abacha, A. B., & Demner-Fushman, D. On the summarization of consumer health questions. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2228–2234 (2019).

  45. Zeng, G. et al. Meddialog: large-scale medical dialogue datasets. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 9241–9250 (Association for Computational Linguistics, 2020).

  46. Johnson, A. E. et al. MIMIC-III a freely accessible critical care database. Sci. Data 3, 1–9 (2019).

    Google Scholar

  47. Dubey, S. et al. Using machine learning for healthcare treatment planning. Front. Artif. Intell. 6, 1124182 (2023).

  48. Roberts, K. et al. Overview of the TREC 2021 clinical trials track. In Proc. Thirtieth Text Retrieval Conference (TREC, 2021).

  49. Van Aken, B. et al. Clinical outcome prediction from admission notes using self-supervised knowledge integration. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 881–893 (Association for Computational Linguistics, 2021).

  50. OpenAI. GPT-4V(ision) system card. OpenAI https://openai.com/research/gpt-4v-system-card (2023).

  51. Wang, P. et al. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. Proc. Int. Conf. Mach. Learn. PMLR 162, 23318–23340 (2022).

    Google Scholar

  52. Hu, X. et al. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. In Proc. 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 4156–4165 (Association for Computing Machinery, 2023).

  53. Jeong, J. et al. Multimodal image-text matching improves retrieval-based chest x-ray report generation. In Proc. Medical Imaging with Deep Learning 227 978–990 (Proceedings of Machine Learning Research, 2024).

  54. Fu, S. et al. Assessment of data quality variability across two EHR systems through a case study of post-surgical complications. In Proc. AMIA Joint Summits on Translational Science 196–205 (American Medical Informatics Association, 2022).

  55. Delbrouck, J. B. et al. Improving the factual correctness of radiology report generation with semantic rewards. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 4348–4360 (Association for Computational Linguistics, 2022).

  56. Yang, H., Lin, J., Yang, A., Wang, P. & Zhou, C. Prompt tuning for unified multimodal pretrained models. In Findings of the Association for Computational Linguistics: ACL 2023 (eds. Rogers, A., Boyd-Graber, J. & Okazaki, N.) 402–416 (Association for Computational Linguistics, 2023).

  57. Chen, Z. et al. Towards understanding the mixture-of-experts layer in deep learning. Adv. Neural Inf. Process. Syst. 35, 23049–23062 (2022).

    Google Scholar

  58. Dosovitskiy, A. et al. An image is worth 16×16 words: transformers for image recognition at scale. In International Conference on Learning Representations. (2021).

  59. Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pretraining of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds. Burstein, J., Doran, C. & Solorio, T.) 4171–4186 (Association for Computational Linguistics, 2019).

  60. Ke, G. He, D. & Liu, T. Y. Rethinking positional encoding in language pretraining. In International Conference on Learning Representations (ICLR, 2019).

  61. Ba, J. L., Kiros, J. R. & Hinton, G.E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016)

  62. Shleifer, S., Weston, J. & Ott, M., Normformer: Improved transformer pretraining with extra normalization. Preprint at https://arxiv.org/abs/2110.09456 (2021).

  63. Dai, Z., Liu, H., Le, Q. V. & Tan, M. Coatnet: marrying convolution and attention for all data sizes. In Proc. Advances in Neural Information Processing Systems 34 (NeurIPS 2021) 3965–3977 (Neural Information Processing Systems, 2021).

  64. Wang, Z. et al. SimVLM: simple visual language model pretraining with weak supervision. In International Conference on Learning Representations. (International Conference on Learning Representations, 2022).

  65. Esser, P., Rombach, R. & Ommer, B. Taming transformers for high-resolution image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 12873–12883 (Institute of Electrical and Electronics Engineers/Computer Vision Foundation, 2021).

  66. Chen, T. et al. Pix2seq: a language modeling framework for object detection. In International Conference on Learning Representations (International Conference on Learning Representations, 2022).

  67. Gage, P. A new algorithm for data compression. C. Users J. 12, 23–38 (1994).

    Google Scholar

  68. He, K. et al. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (Institute of Electrical and Electronics Engineers, 2016).

  69. Wei, J. et al. Finetuned language models are zero-shot learners. In International Conference on Learning Representations (International Conference on Learning Representations, 2022).

  70. Schick, T. & Schütze, H. It’s not just size that matters: small language models are also few-shot learners. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds. Toutanova, K. et al.) 2339-2352 (Association for Computational Linguistics, 2021).

  71. Bao, H. et al. BEiT: BERT pretraining of image transformers. In International Conference on Learning Representations (International Conference on Learning Representations, 2022).

  72. Xu, H. et al. E2E-VLP: end-to-end vision-language pretraining enhanced by visual learning. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (eds. Zong, C. et al.) 503–513 (2021).

  73. Sutskever, I., Vinyals, O. & Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27 (Conference on Neural Information Processing Systems, 2014).

  74. Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (International Conference on Learning Representations, 2019).

  75. Micikevicius, P. et al. Mixed precision training. In International Conference on Learning Representations (International Conference on Learning Representations, 2018).

  76. Raghu, M. et al. Transfusion: understanding transfer learning for medical imaging. In Advances in Neural Information Processing Systems 32 (Conference on Neural Information Processing Systems, 2019).

  77. Zhou, C. et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT. Preprint at https://arxiv.org/abs/2302.09419 (2023).

Download references

Acknowledgements

NSF grant CRII-2246067, NSF POSE: Phase II-2346158 and Lehigh Grant FRGS00011497 supported L.S., K.Z., Z.Y. and Y.L. NIH grant R21EY034179, NSF grants NCS-2319451, MRI-2215789 and IIS-1909879, as well as Lehigh’s Accelerator and CORE grants S00010293 and S001250, supported L.H. and R.Z. NIH grants R01HL159183 and RF1AG057892 supported Q.L. NIH grant R03AG078625 supported X.L. NIH grants R01EB19403 and R01LM11934, supported S.F. and H.L. Icons used in Fig. 2 were made by Freepike, surang, Smartline and Blackonion02 at www.flaticon.com.

Author information

Authors and Affiliations

  1. Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA, USA

    Kai Zhang,Rong Zhou,Eashan Adhikarla,Zhiling Yan,Yixin Liu,Jun Yu,Brian D. Davison,Lifang He&Lichao Sun

  2. School of Computing, University of Georgia, Athens, GA, USA

    Zhengliang Liu&Tianming Liu

  3. Samsung Research America, Mountain View, CA, USA

    Xun Chen

  4. Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA

    Hui Ren,Xiang Li&Quanzheng Li

  5. Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA

    Jing Huang&Yong Chen

  6. PolicyLab, Children’s Hospital of Philadelphia, Philadelphia, PA, USA

    Jing Huang

  7. Center for Research in Computer Vision, University of Central Florida, Orlando, FL, USA

    Chen Chen

  8. Department of Computer Science and Engineering, University of California, Santa Cruz, CA, USA

    Yuyin Zhou

  9. McWilliams School of Biomedical Informatics, UTHealth, Houston, TX, USA

    Sunyang Fu&Hongfang Liu

  10. Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, USA

    Wei Liu

  11. The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania, Philadelphia, PA, USA

    Yong Chen

  12. Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA, USA

    Yong Chen

  13. Leonard Davis Institute of Health Economics, Philadelphia, PA, USA

    Yong Chen

  14. Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA

    James Zou

  15. Department of Computer Science, Stanford University, Stanford, CA, USA

    James Zou

Authors

  1. Kai Zhang

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  2. Rong Zhou

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  3. Eashan Adhikarla

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  4. Zhiling Yan

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  5. Yixin Liu

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  6. Jun Yu

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  7. Zhengliang Liu

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  8. Xun Chen

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  9. Brian D. Davison

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  10. Hui Ren

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  11. Jing Huang

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  12. Chen Chen

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  13. Yuyin Zhou

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  14. Sunyang Fu

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  15. Wei Liu

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  16. Tianming Liu

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  17. Xiang Li

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  18. Yong Chen

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  19. Lifang He

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  20. James Zou

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  21. Quanzheng Li

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  22. Hongfang Liu

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  23. Lichao Sun

    View author publications

    You can also search for this author in PubMedGoogle Scholar

Contributions

K.Z. and L.S. designed the study. K.Z., R.Z. and E.A. carried out data collection, data preprocessing, model construction and model validation. J.Y., Z.Y., Y.L. and Z.L. carried out the data analysis benchmarking results. X.C., B.D.D., J.H., C.C., Y.Z., S.F., W.L., T.L., X.L., Y.C., L.H., J.Z., Q.L. and H.L. provided knowledge support and interpreted the findings. H.R. carried out the human evaluation for the generated text from BiomedGPT as well as GPT-4V. L.S. provided knowledge support, interpreted the findings and supervised the study. All authors contributed to manuscript writing and reviewed and approved the final version. L.H., X.L. and L.S. co-supervised the study.

Corresponding authors

Correspondence to Xiang Li, Lifang He or Lichao Sun.

Ethics declarations

Competing interests

The research was conducted independently of any commercial or financial relationships that could be construed as a potential conflict of interest. Although X.C. is employed by Samsung, the company was not involved in any aspect of this research. The other authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Statistics of pretraining and fine-tuning datasets.

(a) Modality distribution of pretraining data used in BiomedGPT. (b) For the training and testing splits of datasets used in downstream fine-tuning, we typically follow the format of number of training samples/number of validation samples/number of test samples to detail each dataset. More details of the data split are described in Supplementary Table 7.

Extended Data Fig. 2 Overview of BiomedGPT’s model configuration and architecture.

(a) Detailed model configuration of BiomedGPT. Here, ‘#’ indicates number of. ‘Att.’, ‘Enc.’ and ‘Dec.’ indicate Attention, Encoder and Decoder, respectively. The hidden size is the size of the embeddings and the size of the output of each self-attention and feed-forward layer. The first layer of FFN expands the hidden size to the intermediate size, and the second layer contracts it back to the hidden size. This expansion and contraction allow the network to create more complex representations. During the pretraining phase, image processing involves resizing and cropping the images to varying resolutions, corresponding to the input sizes listed in the table. It should be noted that during fine-tuning and inference stages, the input resolution of BiomedGPT can be flexibly adjusted according to the specific requirements of the task. (b) The neural network architecture of BiomedGPT, which includes bidirectional encoder blocks and autoregressive decoder blocks. The number of blocks varies for different model scales.

Extended Data Fig. 3 The graphical illustrations of the key components in BiomedGPT.

(a) Head-scale multi-head attention module in BiomedGPT. The trainable parameters γh is applied prior to the output projection for each head. (b) Instead of adding the absolute positional embedding Pi to the input embedding Ii (left), we compute the positional correlation and input correlation separately with different projection matrices and add them together in the self-attention module (right). (c) Graphical illustration of relative position bias. Such an inductive bias Bj-i is learnable parameter and can be viewed as the embedding of the relative position ji, which is injected into the Query-Key product: \(\frac{1}{\sqrt{d}}({I}_{i}{W}^{\,Q})({P}_{i}{W}^{\,K})+{B}_{j-i}\), and shared in all layers. (d) An example of trie-based beam search: along the path across ‘Lipid’ and ‘breakdown’, BiomedGPT sets logits for all invalid tokens (‘mechanism’ and ‘pathway’) to −∞ when computing log-probabilities for the target token ‘in’. It is worth noting that trie-based search is also applied during the validation phase of the fine-tuning stage for acceleration (approximately 16× increase in speed in our experiments).

Extended Data Fig. 4 Comparative Performance of BiomedGPT and Med-PaLM M and the prompt tuning results in Image classification.

(a) Comparison between BiomedGPT-B and Med-PaLM M on CBIS-DDSM dataset. (b) The experimental results of prompt tuning BiomedGPT-B on three image classification datasets. Prompt tuning learns ‘soft prompts’ or extra model parameters for each task instead of making a task-specific copy of the entire pretrained model for each downstream task and inference must be performed in separate batches. We must mention that the addition of soft prompts is contrary to the design principle of the generalist model. We injected two prompt layers into the encoder and decoder, and varied the prompt length {20, 40, 60, 80, 100, 120} to investigate the performance comparison against full-model fine-tuning. The preliminary results of ‘Colon pathology’, ‘Blood cell microscope’, and ‘Chest X-ray’ were obtained after 100, 512, and 55 training epochs respectively, all with a consistent batch size of 512. We observed that as the prompt length increases, the model performance tends to improve. However, despite an increased number of tuning epochs compared with fine-tuning on the original BiomedGPT (Fig. 3c), the performance after prompt tuning notably lags behind that of model fine-tuning. Specifically, considering only the best results in prompt tuning, there are substantial accuracy reductions of 32.3%, 54.6%, and 32.6% on these three datasets, respectively.

Extended Data Fig. 5 Additional zero-shot results of BiomedGPT.

(a) Graphical illustration of zero-shot classification using CLIP-style models, linear probing transfer learning using VIT or BERT-style models, and zero-shot generation of BiomedGPT. Notably, our model can generate the response without providing additional components such as the label candidates for CLIP or linear classifier requiring training for ViT. (b) Zero-shot performance on five disease diagnosis tasks. (c) BiomedGPT shows competitive zero-shot performance compared with Med-PaLM M with a much smaller model scale. The SOTA fine-tuned model for TB detection is TBLightNet. Note that no single model consistently outperforms the others across all four metrics used in report generation. Here, SOTAs represent the best performance achieved in each specific metric. We fine-tuned our pretrained BiomedGPT-B on MultiMedBench, which Med-PaLM M proposed and used for fine-tuning based on the pretrained PaLM-E. We also attempted to fine-tune LLaVA-Med; however, the time and computational costs were prohibitive due to the large scale of the model and data. Therefore, we reported the results using the pretrained checkpoint of LLaVA-Med.

Full size table
Full size table
Full size table
Full size table
Full size table

Supplementary information

Supplementary Information

Supplementary Figs. 1–9 and Supplementary Tables 1–7.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

A generalist vision–language foundation model for diverse biomedical tasks (1)

Cite this article

Zhang, K., Zhou, R., Adhikarla, E. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat Med (2024). https://doi.org/10.1038/s41591-024-03185-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41591-024-03185-2

A generalist vision–language foundation model for diverse biomedical tasks (2024)
Top Articles
New Orleans Gumbo Recipe, Whats Cooking America
Savory Cream Cheese & Olive Spread
Ds Cuts Saugus
Ingles Weekly Ad Lilburn Ga
Hallowed Sepulchre Instances & More
How do you mix essential oils with carrier oils?
Learn How to Use X (formerly Twitter) in 15 Minutes or Less
Sitcoms Online Message Board
OSRS Dryness Calculator - GEGCalculators
Mary Kay Lipstick Conversion Chart PDF Form - FormsPal
Byte Delta Dental
Vanessa West Tripod Jeffrey Dahmer
Busby, FM - Demu 1-3 - The Demu Trilogy - PDF Free Download
Destiny 2 Salvage Activity (How to Complete, Rewards & Mission)
Spoilers: Impact 1000 Taping Results For 9/14/2023 - PWMania - Wrestling News
Self-Service ATMs: Accessibility, Limits, & Features
Tips on How to Make Dutch Friends & Cultural Norms
Greenville Sc Greyhound
Lexus Credit Card Login
Discord Nuker Bot Invite
Plost Dental
Kroger Feed Login
Labcorp.leavepro.com
Tactical Masters Price Guide
Lawrence Ks Police Scanner
Used Safari Condo Alto R1723 For Sale
Los Amigos Taquería Kalona Menu
What Is Xfinity and How Is It Different from Comcast?
Tamil Play.com
Roto-Rooter Plumbing and Drain Service hiring General Manager in Cincinnati Metropolitan Area | LinkedIn
Chilangos Hillsborough Nj
Eleceed Mangaowl
Hisense Ht5021Kp Manual
Hannibal Mo Craigslist Pets
The Syracuse Journal-Democrat from Syracuse, Nebraska
Dmitri Wartranslated
Ticket To Paradise Showtimes Near Marshall 6 Theatre
The Banshees Of Inisherin Showtimes Near Reading Cinemas Town Square
Final Jeopardy July 25 2023
Wasmo Link Telegram
Weather Underground Cedar Rapids
Acts 16 Nkjv
Miami Vice turns 40: A look back at the iconic series
Smite Builds Season 9
Blackwolf Run Pro Shop
Dr Mayy Deadrick Paradise Valley
Coffee County Tag Office Douglas Ga
UT Announces Physician Assistant Medicine Program
Breaking down the Stafford trade
Nope 123Movies Full
Doelpuntenteller Robert Mühren eindigt op 38: "Afsluiten in stijl toch?"
Kidcheck Login
Latest Posts
Article information

Author: Msgr. Refugio Daniel

Last Updated:

Views: 6463

Rating: 4.3 / 5 (74 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Msgr. Refugio Daniel

Birthday: 1999-09-15

Address: 8416 Beatty Center, Derekfort, VA 72092-0500

Phone: +6838967160603

Job: Mining Executive

Hobby: Woodworking, Knitting, Fishing, Coffee roasting, Kayaking, Horseback riding, Kite flying

Introduction: My name is Msgr. Refugio Daniel, I am a fine, precious, encouraging, calm, glamorous, vivacious, friendly person who loves writing and wants to share my knowledge and understanding with you.