1-800-BAD-CODE
commited on
Commit
•
aaf2e06
1
Parent(s):
fc68459
Update README.md
Browse files
README.md
CHANGED
@@ -61,6 +61,86 @@ language:
|
|
61 |
This is an `xlm-roberta` fine-tuned to restore punctuation, true-case (capitalize),
|
62 |
and detect sentence boundaries (full stops) in 47 languages.
|
63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
# Model Architecture
|
65 |
This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
|
66 |
in every language without language-specific behavior:
|
|
|
61 |
This is an `xlm-roberta` fine-tuned to restore punctuation, true-case (capitalize),
|
62 |
and detect sentence boundaries (full stops) in 47 languages.
|
63 |
|
64 |
+
# Usage
|
65 |
+
|
66 |
+
The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
67 |
+
|
68 |
+
```bash
|
69 |
+
$ pip install punctuators
|
70 |
+
```
|
71 |
+
|
72 |
+
Though this is just an ONNX and SentencePiece model, so you may run it as you wish.
|
73 |
+
|
74 |
+
<details open>
|
75 |
+
|
76 |
+
<summary>Example Usage</summary>
|
77 |
+
|
78 |
+
```python
|
79 |
+
|
80 |
+
from typing import List
|
81 |
+
|
82 |
+
from punctuators.models import PunctCapSegModelONNX
|
83 |
+
|
84 |
+
m: PunctCapSegModelONNX = PunctCapSegModelONNX.from_pretrained(
|
85 |
+
"1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase"
|
86 |
+
)
|
87 |
+
|
88 |
+
input_texts: List[str] = [
|
89 |
+
# "hello world how's it going did you see the game last night my favorite team was playing and i got to go to "
|
90 |
+
# "the game it went into overtime and i got home late i like most sports but some are kind of boring especially "
|
91 |
+
# "baseball most of the time they aren't really playing they're just standing around waiting for something to "
|
92 |
+
# "happen i wish it were more exiting like football or hockey in those sports you have practically non stop play "
|
93 |
+
# "and everyone is involved in the game at all times unlike in baseball where it's only one person at a time",
|
94 |
+
# "hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas "
|
95 |
+
# "de la ciudad",
|
96 |
+
"hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
|
97 |
+
# "未來疫苗將有望覆蓋3歲以上全年齡段美國與北約軍隊已全部撤離還有鐵路公路在內的各項基建的來源都將枯竭",
|
98 |
+
# "በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድዮ ዘግቦ ነበር ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ "
|
99 |
+
# "በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል",
|
100 |
+
# "all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and "
|
101 |
+
# "should act towards one another in a spirit of brotherhood",
|
102 |
+
# "सभी मनुष्य जन्म से मर्यादा और अधिकारों में स्वतंत्र और समान होते हैं वे तर्क और विवेक से संपन्न हैं तथा उन्हें भ्रातृत्व की भावना से परस्पर के प्रति कार्य करना चाहिए",
|
103 |
+
# "wszyscy ludzie rodzą się wolni i równi pod względem swej godności i swych praw są oni obdarzeni rozumem i "
|
104 |
+
# "sumieniem i powinni postępować wobec innych w duchu braterstwa",
|
105 |
+
# "tous les êtres humains naissent libres et égaux en dignité et en droits ils sont doués de raison et de conscience "
|
106 |
+
# "et doivent agir les uns envers les autres dans un esprit de fraternité",
|
107 |
+
]
|
108 |
+
input_texts: List[str] = [
|
109 |
+
"hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad",
|
110 |
+
"hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
|
111 |
+
"未來疫苗將有望覆蓋3歲以上全年齡段美國與北約軍隊已全部撤離還有鐵路公路在內的各項基建的來源都將枯竭",
|
112 |
+
"በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድዮ ዘግቦ ነበር ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል",
|
113 |
+
"all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and should act towards one another in a spirit of brotherhood",
|
114 |
+
"सभी मनुष्य जन्म से मर्यादा और अधिकारों में स्वतंत्र और समान होते हैं वे तर्क और विवेक से संपन्न हैं तथा उन्हें भ्रातृत्व की भावना से परस्पर के प्रति कार्य करना चाहिए",
|
115 |
+
"wszyscy ludzie rodzą się wolni i równi pod względem swej godności i swych praw są oni obdarzeni rozumem i sumieniem i powinni postępować wobec innych w duchu braterstwa",
|
116 |
+
"tous les êtres humains naissent libres et égaux en dignité et en droits ils sont doués de raison et de conscience et doivent agir les uns envers les autres dans un esprit de fraternité",
|
117 |
+
]
|
118 |
+
|
119 |
+
results: List[List[str]] = m.infer(
|
120 |
+
texts=input_texts, apply_sbd=True,
|
121 |
+
)
|
122 |
+
for input_text, output_texts in zip(input_texts, results):
|
123 |
+
print(f"Input: {input_text}")
|
124 |
+
print(f"Outputs:")
|
125 |
+
for text in output_texts:
|
126 |
+
print(f"\t{text}")
|
127 |
+
print()
|
128 |
+
|
129 |
+
```
|
130 |
+
|
131 |
+
</details>
|
132 |
+
|
133 |
+
|
134 |
+
<details open>
|
135 |
+
|
136 |
+
<summary>Expected output</summary>
|
137 |
+
|
138 |
+
```text
|
139 |
+
|
140 |
+
```
|
141 |
+
|
142 |
+
</details>
|
143 |
+
|
144 |
# Model Architecture
|
145 |
This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
|
146 |
in every language without language-specific behavior:
|