A generalist vision–language foundation model for diverse biomedical tasks (2024)

Data availability

All data in this study are publicly available and can be accessed from: IU X-ray and Peir Gross (https://github.com/nlpaueb/bioCaption), MedICat (https://github.com/allenai/medicat), PathVQA (https://huggingface.co/datasets/flaviagiammarino/path-vqa), SLAKE 1.0 (https://www.med-vqa.com/slake/), DeepLesion (https://nihcc.app.box.com/v/DeepLesion), OIA-DDR (https://github.com/nkicsl/OIA), CheXpert- v1.0-small (https://www.kaggle.com/datasets/willarevalo/chexpert-v10-small), CytoImageNet (https://www.kaggle.com/datasets/stanleyhua/cytoimagenet), ISIC 2020 (https://challenge2020.isic-archive.com), Retinal Fundus (https://www.kaggle.com/c/diabetic-retinopathy-detection), MIMIC-III Clinic Notes (https://paperswithcode.com/dataset/hospital-admission-notes-from-mimic-iii), NCBI BioNLP (https://www.ncbi.nlm.nih.gov/research/bionlp/Data/), PubMed abstracts derived from the BLUE benchmark (https://github.com/ncbi-nlp/BLUE_Benchmark), VQA-RAD (https://osf.io/89kps/), CBIS-DDSM (https://www.kaggle.com/datasets/awsaf49/cbis-ddsm-breast-cancer-image-dataset), SZ-CXR and MC-CXR (access can be requested via the contact at http://archive.nlm.nih.gov/repos/chestImages.php), MIMIC-CXR (https://physionet.org/content/mimic-cxr-jpg/2.1.0/), MedNLI (https://physionet.org/content/mednli/1.0.0/), TREC 2022 (https://www.trec-cds.org/2022.html), SEER (https://seer.cancer.gov), MIMIC-III (https://physionet.org/content/mimiciii/1.4/), HealthcareMagic (https://huggingface.co/datasets/UCSD26/medical_dialog), MeQSum (https://huggingface.co/datasets/sumedh/MeQSum), MedMNIST v2 (https://medmnist.com) and ROCO (https://github.com/razorx89/roco-dataset). A randomly sampled subset of RSNA Pneumonia Detection Challenge (2018) was used for zero-shot prediction (https://www.rsna.org/rsnai/ai-image-challenge/rsna-pneumonia-detection-challenge-2018). The MedMNIST-Raw is curated using multiple sources, including NCT-CRC-HE-100K (colon pathology) (https://zenodo.org/records/1214456), HAM10000 (dermoscopy) (https://github.com/ptschandl/HAM10000_dataset), OCT and Chest X-ray (https://data.mendeley.com/datasets/rscbjbr9sj/3), breast ultrasound (https://scholar.cu.edu.eg/Dataset_BUSI.zip), blood cell microscopy (https://data.mendeley.com/datasets/snkd93bnjr/1) and Liver Tumor Segmentation Benchmark (LiTS) (https://competitions.codalab.org/competitions/17094). The VQA data for human evaluation are derived from Medical-Diff-VQA (https://physionet.org/content/medical-diff-vqa/1.0.0/), with the exclusion of questions related to differences, as these require a two-image input. Report generation and summarization samples for human evaluations are extracted from MIMIC-CXR. The instruction-following data used in this article are derived from Pubmed (https://pubmed.ncbi.nlm.nih.gov) following the LLaVA-Med approach (https://github.com/microsoft/LLaVA-Med/blob/main/download_data.sh) and are combined with training sets from PathVQA and SLAKE. We also provided the table with more details of the major datasets in Extended Data Table 2.

Code availability

The pretrained and fine-tuned models, as well as source code for training, inference and data preprocessing, can be accessed at https://github.com/taokz/BiomedGPT.


Download references


NSF grant CRII-2246067, NSF POSE: Phase II-2346158 and Lehigh Grant FRGS00011497 supported L.S., K.Z., Z.Y. and Y.L. NIH grant R21EY034179, NSF grants NCS-2319451, MRI-2215789 and IIS-1909879, as well as Lehigh’s Accelerator and CORE grants S00010293 and S001250, supported L.H. and R.Z. NIH grants R01HL159183 and RF1AG057892 supported Q.L. NIH grant R03AG078625 supported X.L. NIH grants R01EB19403 and R01LM11934, supported S.F. and H.L. Icons used in Fig. 2 were made by Freepike, surang, Smartline and Blackonion02 at www.flaticon.com.

Author information

Authors and Affiliations

  1. Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA, USA

    Kai Zhang,Rong Zhou,Eashan Adhikarla,Zhiling Yan,Yixin Liu,Jun Yu,Brian D. Davison,Lifang He&Lichao Sun

  2. School of Computing, University of Georgia, Athens, GA, USA

    Zhengliang Liu&Tianming Liu

  3. Samsung Research America, Mountain View, CA, USA

    Xun Chen

  4. Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA

    Hui Ren,Xiang Li&Quanzheng Li

  5. Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA

    Jing Huang&Yong Chen

  6. PolicyLab, Children’s Hospital of Philadelphia, Philadelphia, PA, USA

    Jing Huang

  7. Center for Research in Computer Vision, University of Central Florida, Orlando, FL, USA

    Chen Chen

  8. Department of Computer Science and Engineering, University of California, Santa Cruz, CA, USA

    Yuyin Zhou

  9. McWilliams School of Biomedical Informatics, UTHealth, Houston, TX, USA

    Sunyang Fu&Hongfang Liu

  10. Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, USA

    Wei Liu

  11. The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania, Philadelphia, PA, USA

    Yong Chen

  12. Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA, USA

    Yong Chen

  13. Leonard Davis Institute of Health Economics, Philadelphia, PA, USA

    Yong Chen

  14. Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA

    James Zou

  15. Department of Computer Science, Stanford University, Stanford, CA, USA

    James Zou


K.Z. and L.S. designed the study. K.Z., R.Z. and E.A. carried out data collection, data preprocessing, model construction and model validation. J.Y., Z.Y., Y.L. and Z.L. carried out the data analysis benchmarking results. X.C., B.D.D., J.H., C.C., Y.Z., S.F., W.L., T.L., X.L., Y.C., L.H., J.Z., Q.L. and H.L. provided knowledge support and interpreted the findings. H.R. carried out the human evaluation for the generated text from BiomedGPT as well as GPT-4V. L.S. provided knowledge support, interpreted the findings and supervised the study. All authors contributed to manuscript writing and reviewed and approved the final version. L.H., X.L. and L.S. co-supervised the study.

Corresponding authors

Correspondence to Xiang Li, Lifang He or Lichao Sun.

Ethics declarations

Competing interests

The research was conducted independently of any commercial or financial relationships that could be construed as a potential conflict of interest. Although X.C. is employed by Samsung, the company was not involved in any aspect of this research. The other authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Statistics of pretraining and fine-tuning datasets.

(a) Modality distribution of pretraining data used in BiomedGPT. (b) For the training and testing splits of datasets used in downstream fine-tuning, we typically follow the format of number of training samples/number of validation samples/number of test samples to detail each dataset. More details of the data split are described in Supplementary Table 7.

Extended Data Fig. 2 Overview of BiomedGPT’s model configuration and architecture.

(a) Detailed model configuration of BiomedGPT. Here, ‘#’ indicates number of. ‘Att.’, ‘Enc.’ and ‘Dec.’ indicate Attention, Encoder and Decoder, respectively. The hidden size is the size of the embeddings and the size of the output of each self-attention and feed-forward layer. The first layer of FFN expands the hidden size to the intermediate size, and the second layer contracts it back to the hidden size. This expansion and contraction allow the network to create more complex representations. During the pretraining phase, image processing involves resizing and cropping the images to varying resolutions, corresponding to the input sizes listed in the table. It should be noted that during fine-tuning and inference stages, the input resolution of BiomedGPT can be flexibly adjusted according to the specific requirements of the task. (b) The neural network architecture of BiomedGPT, which includes bidirectional encoder blocks and autoregressive decoder blocks. The number of blocks varies for different model scales.

Extended Data Fig. 3 The graphical illustrations of the key components in BiomedGPT.

(a) Head-scale multi-head attention module in BiomedGPT. The trainable parameters γh is applied prior to the output projection for each head. (b) Instead of adding the absolute positional embedding Pi to the input embedding Ii (left), we compute the positional correlation and input correlation separately with different projection matrices and add them together in the self-attention module (right). (c) Graphical illustration of relative position bias. Such an inductive bias Bj-i is learnable parameter and can be viewed as the embedding of the relative position ji, which is injected into the Query-Key product: \(\frac{1}{\sqrt{d}}({I}_{i}{W}^{\,Q})({P}_{i}{W}^{\,K})+{B}_{j-i}\), and shared in all layers. (d) An example of trie-based beam search: along the path across ‘Lipid’ and ‘breakdown’, BiomedGPT sets logits for all invalid tokens (‘mechanism’ and ‘pathway’) to −∞ when computing log-probabilities for the target token ‘in’. It is worth noting that trie-based search is also applied during the validation phase of the fine-tuning stage for acceleration (approximately 16× increase in speed in our experiments).

Extended Data Fig. 4 Comparative Performance of BiomedGPT and Med-PaLM M and the prompt tuning results in Image classification.

(a) Comparison between BiomedGPT-B and Med-PaLM M on CBIS-DDSM dataset. (b) The experimental results of prompt tuning BiomedGPT-B on three image classification datasets. Prompt tuning learns ‘soft prompts’ or extra model parameters for each task instead of making a task-specific copy of the entire pretrained model for each downstream task and inference must be performed in separate batches. We must mention that the addition of soft prompts is contrary to the design principle of the generalist model. We injected two prompt layers into the encoder and decoder, and varied the prompt length {20, 40, 60, 80, 100, 120} to investigate the performance comparison against full-model fine-tuning. The preliminary results of ‘Colon pathology’, ‘Blood cell microscope’, and ‘Chest X-ray’ were obtained after 100, 512, and 55 training epochs respectively, all with a consistent batch size of 512. We observed that as the prompt length increases, the model performance tends to improve. However, despite an increased number of tuning epochs compared with fine-tuning on the original BiomedGPT (Fig. 3c), the performance after prompt tuning notably lags behind that of model fine-tuning. Specifically, considering only the best results in prompt tuning, there are substantial accuracy reductions of 32.3%, 54.6%, and 32.6% on these three datasets, respectively.

Extended Data Fig. 5 Additional zero-shot results of BiomedGPT.

(a) Graphical illustration of zero-shot classification using CLIP-style models, linear probing transfer learning using VIT or BERT-style models, and zero-shot generation of BiomedGPT. Notably, our model can generate the response without providing additional components such as the label candidates for CLIP or linear classifier requiring training for ViT. (b) Zero-shot performance on five disease diagnosis tasks. (c) BiomedGPT shows competitive zero-shot performance compared with Med-PaLM M with a much smaller model scale. The SOTA fine-tuned model for TB detection is TBLightNet. Note that no single model consistently outperforms the others across all four metrics used in report generation. Here, SOTAs represent the best performance achieved in each specific metric. We fine-tuned our pretrained BiomedGPT-B on MultiMedBench, which Med-PaLM M proposed and used for fine-tuning based on the pretrained PaLM-E. We also attempted to fine-tune LLaVA-Med; however, the time and computational costs were prohibitive due to the large scale of the model and data. Therefore, we reported the results using the pretrained checkpoint of LLaVA-Med.

Full size table
Full size table
Full size table
Full size table
Full size table

Supplementary information

Supplementary Information

Supplementary Figs. 1–9 and Supplementary Tables 1–7.

