vinid commited on
Commit
3ca8c75
1 Parent(s): 1a22608

update with IRR and new images

Browse files
examples.py CHANGED
@@ -36,3 +36,15 @@ def app():
36
  st.subheader("una coppia che passeggia sulla spiaggia al tramonto")
37
  st.markdown("*a couple walking on the beach at sunset*")
38
  st.image("static/img/examples/couple_3.jpeg")
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  st.subheader("una coppia che passeggia sulla spiaggia al tramonto")
37
  st.markdown("*a couple walking on the beach at sunset*")
38
  st.image("static/img/examples/couple_3.jpeg")
39
+
40
+ st.markdown("### 2. Dresses")
41
+
42
+ col1, col2 = st.beta_columns(2)
43
+ col1.subheader("un vestito primavrile")
44
+ col1.markdown("*a dress for the spring*")
45
+ col1.image("static/img/examples/vestito1.png")
46
+
47
+ col2.subheader("un vestito autunnale")
48
+ col2.markdown("*a dress for the autumn*")
49
+ col2.image("static/img/examples/vestito_autunnale.png")
50
+
introduction.md CHANGED
@@ -25,6 +25,9 @@ have the highest similarity with the text query.
25
  + *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
26
  is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
27
 
 
 
 
28
  # Novel Contributions
29
 
30
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
@@ -34,6 +37,11 @@ To get competitive results we followed three strategies:
34
  2. better augmentations;
35
  3. better training.
36
 
 
 
 
 
 
37
  ## More and Better Data
38
 
39
  We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
@@ -65,6 +73,22 @@ a dataset with 700K translated captions.
65
  + [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper. The collection contains almost 30K pairs: starting from early 2011, for each day, editors at Il Post pick several images picturing the most salient events in the world. Each photo comes along with an Italian caption.
66
 
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ## Better Augmentations
69
 
70
  We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefor we implemented heavy augmentations to make the training more data efficient. They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however to still give the model the ability to learn color definitions.
@@ -116,6 +140,14 @@ We selected two different tasks:
116
  + image-retrieval
117
  + zero-shot classification
118
 
 
 
 
 
 
 
 
 
119
  ### Image Retrieval
120
 
121
  This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
@@ -131,7 +163,6 @@ we use the MRR.
131
  It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
132
  on 400million images (and some of them probably were from MSCOCO).
133
 
134
- You can find the colab to quickly rerun the experiments here: [Colab](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
135
 
136
  ### Zero-shot image classification
137
 
@@ -146,13 +177,13 @@ To do this, we used DeepL to translate the image labels in ImageNet. We evaluate
146
  | Accuracy@10 | **52.55** | 42.91 |
147
  | Accuracy@100 | **81.08** | 67.11 |
148
 
149
- You can find the colab to quickly rerun the experiments here: [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
150
-
151
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
152
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
153
  paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
154
  the translated image labels might have had an impact on the final scores.
155
 
 
 
156
  ## Qualitative Evaluation
157
 
158
  We hereby show some very interesting properties of the model. One is its ability to detect colors,
@@ -199,6 +230,8 @@ Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). [Conceptual capti
199
 
200
  Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
201
 
 
 
202
  Reimers, N., & Gurevych, I. (2020, November). [Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.](https://aclanthology.org/2020.emnlp-main.365/) In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
203
 
204
  Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). [Learning Transferable Visual Models From Natural Language Supervision.](https://arxiv.org/abs/2103.00020) ICML.
 
25
  + *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
26
  is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
27
 
28
+ + *Examples and Applications*: This page showcases some interesting results we got from the model, we believe that there are
29
+ different applications that can start from here.
30
+
31
  # Novel Contributions
32
 
33
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
 
37
  2. better augmentations;
38
  3. better training.
39
 
40
+ For those interested, we have a [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics)
41
+ that shows a **subset** of the experiments we run. Different hyper-parameters played a role in reducing the validation
42
+ loss. The optimizer gave us great performance and huge conversion speed, more data and augmentations helped a lot in generalizing,
43
+ working on the training and on the loss gave us the final increase that you can see in the results.
44
+
45
  ## More and Better Data
46
 
47
  We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
 
73
  + [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper. The collection contains almost 30K pairs: starting from early 2011, for each day, editors at Il Post pick several images picturing the most salient events in the world. Each photo comes along with an Italian caption.
74
 
75
 
76
+ ### A Note on Translations
77
+
78
+ Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
79
+ reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource
80
+ but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
81
+
82
+ Two of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
83
+ 1: the sentence has lost is meaning or it's not possible to understand it; 2: it is possible to get the idea
84
+ but there something wrong; 3: good, however a native speaker might complain about some translations; 4: good translation.
85
+
86
+ The average score was of 3.8 and the two annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
87
+ weighting - of 0.86 (great agreement!).
88
+
89
+ We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
90
+ that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
91
+
92
  ## Better Augmentations
93
 
94
  We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefor we implemented heavy augmentations to make the training more data efficient. They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however to still give the model the ability to learn color definitions.
 
140
  + image-retrieval
141
  + zero-shot classification
142
 
143
+ ### Reproducibiliy
144
+
145
+ Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
146
+
147
+ + [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
148
+ + [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
149
+
150
+
151
  ### Image Retrieval
152
 
153
  This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
 
163
  It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
164
  on 400million images (and some of them probably were from MSCOCO).
165
 
 
166
 
167
  ### Zero-shot image classification
168
 
 
177
  | Accuracy@10 | **52.55** | 42.91 |
178
  | Accuracy@100 | **81.08** | 67.11 |
179
 
 
 
180
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
181
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
182
  paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
183
  the translated image labels might have had an impact on the final scores.
184
 
185
+
186
+
187
  ## Qualitative Evaluation
188
 
189
  We hereby show some very interesting properties of the model. One is its ability to detect colors,
 
230
 
231
  Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
232
 
233
+ Gwet, K. L. (2008). [Computing inter‐rater reliability and its variance in the presence of high agreement.](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) British Journal of Mathematical and Statistical Psychology, 61(1), 29-48.
234
+
235
  Reimers, N., & Gurevych, I. (2020, November). [Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.](https://aclanthology.org/2020.emnlp-main.365/) In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
236
 
237
  Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). [Learning Transferable Visual Models From Natural Language Supervision.](https://arxiv.org/abs/2103.00020) ICML.
static/img/examples/vestito1.png ADDED
static/img/examples/vestito_autunnale.png ADDED