Spaces:
Running
Running
update with IRR and new images
Browse files- examples.py +12 -0
- introduction.md +36 -3
- static/img/examples/vestito1.png +0 -0
- static/img/examples/vestito_autunnale.png +0 -0
examples.py
CHANGED
@@ -36,3 +36,15 @@ def app():
|
|
36 |
st.subheader("una coppia che passeggia sulla spiaggia al tramonto")
|
37 |
st.markdown("*a couple walking on the beach at sunset*")
|
38 |
st.image("static/img/examples/couple_3.jpeg")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
st.subheader("una coppia che passeggia sulla spiaggia al tramonto")
|
37 |
st.markdown("*a couple walking on the beach at sunset*")
|
38 |
st.image("static/img/examples/couple_3.jpeg")
|
39 |
+
|
40 |
+
st.markdown("### 2. Dresses")
|
41 |
+
|
42 |
+
col1, col2 = st.beta_columns(2)
|
43 |
+
col1.subheader("un vestito primavrile")
|
44 |
+
col1.markdown("*a dress for the spring*")
|
45 |
+
col1.image("static/img/examples/vestito1.png")
|
46 |
+
|
47 |
+
col2.subheader("un vestito autunnale")
|
48 |
+
col2.markdown("*a dress for the autumn*")
|
49 |
+
col2.image("static/img/examples/vestito_autunnale.png")
|
50 |
+
|
introduction.md
CHANGED
@@ -25,6 +25,9 @@ have the highest similarity with the text query.
|
|
25 |
+ *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
|
26 |
is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
|
27 |
|
|
|
|
|
|
|
28 |
# Novel Contributions
|
29 |
|
30 |
The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
|
@@ -34,6 +37,11 @@ To get competitive results we followed three strategies:
|
|
34 |
2. better augmentations;
|
35 |
3. better training.
|
36 |
|
|
|
|
|
|
|
|
|
|
|
37 |
## More and Better Data
|
38 |
|
39 |
We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
|
@@ -65,6 +73,22 @@ a dataset with 700K translated captions.
|
|
65 |
+ [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper. The collection contains almost 30K pairs: starting from early 2011, for each day, editors at Il Post pick several images picturing the most salient events in the world. Each photo comes along with an Italian caption.
|
66 |
|
67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
## Better Augmentations
|
69 |
|
70 |
We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefor we implemented heavy augmentations to make the training more data efficient. They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however to still give the model the ability to learn color definitions.
|
@@ -116,6 +140,14 @@ We selected two different tasks:
|
|
116 |
+ image-retrieval
|
117 |
+ zero-shot classification
|
118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
### Image Retrieval
|
120 |
|
121 |
This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
|
@@ -131,7 +163,6 @@ we use the MRR.
|
|
131 |
It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
|
132 |
on 400million images (and some of them probably were from MSCOCO).
|
133 |
|
134 |
-
You can find the colab to quickly rerun the experiments here: [Colab](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
|
135 |
|
136 |
### Zero-shot image classification
|
137 |
|
@@ -146,13 +177,13 @@ To do this, we used DeepL to translate the image labels in ImageNet. We evaluate
|
|
146 |
| Accuracy@10 | **52.55** | 42.91 |
|
147 |
| Accuracy@100 | **81.08** | 67.11 |
|
148 |
|
149 |
-
You can find the colab to quickly rerun the experiments here: [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
|
150 |
-
|
151 |
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
|
152 |
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
|
153 |
paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
|
154 |
the translated image labels might have had an impact on the final scores.
|
155 |
|
|
|
|
|
156 |
## Qualitative Evaluation
|
157 |
|
158 |
We hereby show some very interesting properties of the model. One is its ability to detect colors,
|
@@ -199,6 +230,8 @@ Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). [Conceptual capti
|
|
199 |
|
200 |
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
|
201 |
|
|
|
|
|
202 |
Reimers, N., & Gurevych, I. (2020, November). [Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.](https://aclanthology.org/2020.emnlp-main.365/) In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
|
203 |
|
204 |
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). [Learning Transferable Visual Models From Natural Language Supervision.](https://arxiv.org/abs/2103.00020) ICML.
|
|
|
25 |
+ *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
|
26 |
is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
|
27 |
|
28 |
+
+ *Examples and Applications*: This page showcases some interesting results we got from the model, we believe that there are
|
29 |
+
different applications that can start from here.
|
30 |
+
|
31 |
# Novel Contributions
|
32 |
|
33 |
The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
|
|
|
37 |
2. better augmentations;
|
38 |
3. better training.
|
39 |
|
40 |
+
For those interested, we have a [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics)
|
41 |
+
that shows a **subset** of the experiments we run. Different hyper-parameters played a role in reducing the validation
|
42 |
+
loss. The optimizer gave us great performance and huge conversion speed, more data and augmentations helped a lot in generalizing,
|
43 |
+
working on the training and on the loss gave us the final increase that you can see in the results.
|
44 |
+
|
45 |
## More and Better Data
|
46 |
|
47 |
We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
|
|
|
73 |
+ [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper. The collection contains almost 30K pairs: starting from early 2011, for each day, editors at Il Post pick several images picturing the most salient events in the world. Each photo comes along with an Italian caption.
|
74 |
|
75 |
|
76 |
+
### A Note on Translations
|
77 |
+
|
78 |
+
Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
|
79 |
+
reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource
|
80 |
+
but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
|
81 |
+
|
82 |
+
Two of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
|
83 |
+
1: the sentence has lost is meaning or it's not possible to understand it; 2: it is possible to get the idea
|
84 |
+
but there something wrong; 3: good, however a native speaker might complain about some translations; 4: good translation.
|
85 |
+
|
86 |
+
The average score was of 3.8 and the two annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
|
87 |
+
weighting - of 0.86 (great agreement!).
|
88 |
+
|
89 |
+
We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
|
90 |
+
that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
|
91 |
+
|
92 |
## Better Augmentations
|
93 |
|
94 |
We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefor we implemented heavy augmentations to make the training more data efficient. They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however to still give the model the ability to learn color definitions.
|
|
|
140 |
+ image-retrieval
|
141 |
+ zero-shot classification
|
142 |
|
143 |
+
### Reproducibiliy
|
144 |
+
|
145 |
+
Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
|
146 |
+
|
147 |
+
+ [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
|
148 |
+
+ [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
|
149 |
+
|
150 |
+
|
151 |
### Image Retrieval
|
152 |
|
153 |
This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
|
|
|
163 |
It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
|
164 |
on 400million images (and some of them probably were from MSCOCO).
|
165 |
|
|
|
166 |
|
167 |
### Zero-shot image classification
|
168 |
|
|
|
177 |
| Accuracy@10 | **52.55** | 42.91 |
|
178 |
| Accuracy@100 | **81.08** | 67.11 |
|
179 |
|
|
|
|
|
180 |
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
|
181 |
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
|
182 |
paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
|
183 |
the translated image labels might have had an impact on the final scores.
|
184 |
|
185 |
+
|
186 |
+
|
187 |
## Qualitative Evaluation
|
188 |
|
189 |
We hereby show some very interesting properties of the model. One is its ability to detect colors,
|
|
|
230 |
|
231 |
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
|
232 |
|
233 |
+
Gwet, K. L. (2008). [Computing inter‐rater reliability and its variance in the presence of high agreement.](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) British Journal of Mathematical and Statistical Psychology, 61(1), 29-48.
|
234 |
+
|
235 |
Reimers, N., & Gurevych, I. (2020, November). [Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.](https://aclanthology.org/2020.emnlp-main.365/) In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
|
236 |
|
237 |
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). [Learning Transferable Visual Models From Natural Language Supervision.](https://arxiv.org/abs/2103.00020) ICML.
|
static/img/examples/vestito1.png
ADDED
static/img/examples/vestito_autunnale.png
ADDED