roygan commited on
Commit
c246f00
1 Parent(s): 7d01a88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md CHANGED
@@ -1,3 +1,57 @@
1
  ---
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
  license: apache-2.0
5
+
6
+ tags:
7
+ - bart
8
+ - biobart
9
+ - biomedical
10
+
11
+ inference: true
12
+
13
+ widget:
14
+ - text: "Influenza is a <mask> disease."
15
+ - type: "text-generation"
16
+
17
  ---
18
+ # Yuyuan-Bart-139M, one model of [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM).
19
+ The Yuyuan-Bart-400M is a biomedical generative language model jointly produced by Tsinghua University and International Digital Economy Academy.
20
+
21
+ Paper: [BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model](https://arxiv.org/pdf/2204.03905.pdf)
22
+
23
+ ## Pretraining Corpora
24
+ We use PubMed abstracts as the pretraining corpora. The corpora contain about 41 GB of biomedical research paper abstracts on PubMed.
25
+
26
+ ## Pretraining Setup
27
+ We continuously pretrain large versions of BART for 120k steps with a batch size of 2560. We use the same vocabulary as BART to tokenize the texts. Although the input length limitation of BART is 1024, the tokenized PubMed abstracts rarely exceed 512. Therefore, for the sake of training efficiency, we truncate all the input texts to 512 maximum length. We mask 30% of the input tokens and the masked span length is determined by sampling from a Poisson distribution (λ = 3) as used in BART. We use a learning rate scheduler of 0.02 warm-up ratio and linear decay. The learning rate is set to 1e-4. We train the large version of BioBART(400M parameters) on 2 DGX with 16 40GB A100 GPUs for about 168 hours with the help of the open-resource framework DeepSpeed.
28
+
29
+
30
+
31
+ ## Usage
32
+ ```python
33
+ from transformers import BartForConditionalGeneration, BartTokenizer
34
+ tokenizer = BartTokenizer.from_pretrained('IDEA-CCNL/Yuyuan-Bart-400M')
35
+ model = BartForConditionalGeneration.from_pretrained('IDEA-CCNL/Yuyuan-Bart-400M')
36
+
37
+ text = 'Influenza is a <mask> disease.'
38
+ input_ids = tokenizer([text], return_tensors="pt")['input_ids']
39
+ model.eval()
40
+ generated_ids = model.generate(
41
+ input_ids=input_ids,
42
+ )
43
+ preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
44
+ print(preds)
45
+ ```
46
+
47
+ ## Citation
48
+ If you find the resource is useful, please cite the following website in your paper.
49
+ ```
50
+ @misc{BioBART,
51
+ title={BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model},
52
+ author={Hongyi Yuan and Zheng Yuan and Ruyi Gan and Jiaxing Zhang and Yutao Xie and Sheng Yu},
53
+ year={2022},
54
+ eprint={2204.03905},
55
+ archivePrefix={arXiv}
56
+ }
57
+ ```