baidu
/

ernie-code-560m

Text2Text Generation

Transformers

PyTorch

mt5

Inference Endpoints

Model card Files Files and versions Community

cyk1337 commited on Mar 10

Commit

1e9a811

•

1 Parent(s): d2195a8

Update README.md

Browse files

Files changed (1) hide show

README.md +55 -22

README.md CHANGED Viewed

@@ -15,26 +15,9 @@ ERNIE-Code is a unified large language model (LLM) that connects 116 natural lan
 [ACL 2023 (Findings)](https://aclanthology.org/2023.findings-acl.676/) | [arXiv](https://arxiv.org/pdf/2212.06742)
-### Multilingual Text-to-Code / Code-to-Text
-First preprocess the input prompt:
-```python
-  def clean_up_code_spaces(s: str):
-      # post process
-      # ===========================
-      new_tokens = ["<pad>", "</s>", "<unk>", "\n", "\t", "<|space|>"*4, "<|space|>"*2, "<|space|>"]
-      for tok in new_tokens:
-          s = s.replace(f"{tok} ", tok)
-      cleaned_tokens = ["<pad>", "</s>", "<unk>"]
-      for tok in cleaned_tokens:
-          s = s.replace(tok, "")
-      s = s.replace("<|space|>", " ")
-      # ===========================
-      return s
-```
-Then use `transformers` to load the model:
 ```python
 import torch
 from transformers import (
@@ -49,22 +32,72 @@ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 # note that you can use aforementioned `clean_up_code_spaces` to proprocess the code
-input_code="快速排序"
-prompt="translate Chinese to English: \n%s" % (input_code)  # your prompt here
 model_inputs = tokenizer(prompt, max_length=512, padding=False, truncation=True, return_tensors="pt")
 input_ids = model_inputs.input_ids.cuda() # by default
 attention_mask = model_inputs.attention_mask.cuda() # by default
 output = model.generate(input_ids=input_ids, attention_mask=attention_mask,
-        num_beams=5, max_length=512) # change to your own decoding methods
-# Ensure to customize the post-processing of clean_up_code_spaces output according to specific requirements.
 output = tokenizer.decode(output.flatten(), skip_special_tokens=True)
 ```
 You can also check the official inference code on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-code/README.en.md).
 ### Zero-shot Examples
 - Multilingual code-to-text generation (zero-shot)

 [ACL 2023 (Findings)](https://aclanthology.org/2023.findings-acl.676/) | [arXiv](https://arxiv.org/pdf/2212.06742)
+### Usage
 ```python
 import torch
 from transformers import (
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 # note that you can use aforementioned `clean_up_code_spaces` to proprocess the code
+def format_code_with_spm_compatablity(line: str):
+    format_dict = {
+        " " : "<|space|>"
+    }
+    tokens = list(line)
+    i = 0
+    while i < len(tokens):
+        if line[i] == "\n":
+            while i+1 < len(tokens) and tokens[i+1] == " ":
+                tokens[i+1] = format_dict.get(" ")
+                i += 1
+        i += 1
+    formatted_line = ''.join(tokens)
+    return formatted_line
+TYPE="code" # define input type in ("code", "text")
+input="arr.sort()"
+prompt="translate python to java: \n%s" % (input)  # your prompt here
+TYPE="text" # define input type in ("code", "text")
+input="quick sort"
+prompt="translate English to Japanese: \n%s" % (input)  # your prompt here
+assert TYPE in ("code", "text")
+# preprocess for code input
+if TYPE=="code":
+    prompt = format_code_with_spm_compatablity(prompt)
 model_inputs = tokenizer(prompt, max_length=512, padding=False, truncation=True, return_tensors="pt")
+model = model.cuda() # by default
 input_ids = model_inputs.input_ids.cuda() # by default
 attention_mask = model_inputs.attention_mask.cuda() # by default
 output = model.generate(input_ids=input_ids, attention_mask=attention_mask,
+        num_beams=5, max_length=20) # change to your needs
+# Ensure to customize the post-processing of `clean_up_code_spaces` output according to specific requirements.
 output = tokenizer.decode(output.flatten(), skip_special_tokens=True)
+# post-process the code generation
+def clean_up_code_spaces(s: str):
+    # post process
+    # ===========================
+    new_tokens = ["<pad>", "</s>", "<unk>", "\n", "\t", "<|space|>"*4, "<|space|>"*2, "<|space|>"]
+    for tok in new_tokens:
+        s = s.replace(f"{tok} ", tok)
+    cleaned_tokens = ["<pad>", "</s>", "<unk>"]
+    for tok in cleaned_tokens:
+        s = s.replace(tok, "")
+    s = s.replace("<|space|>", " ")
+    return s
+output = [clean_up_code_spaces(pred) for pred in output]
 ```
+You can adapt [seq2seq translation code](https://github.com/huggingface/transformers/tree/main/examples/pytorch/translation) for finetuning.
 You can also check the official inference code on [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-code/README.en.md).
 ### Zero-shot Examples
 - Multilingual code-to-text generation (zero-shot)