# Classification Challenge using CatBoost

## INF2179 Fall 2021
### Hamid Yuksel

This submission uses [CatBoost](https://catboost.ai/).
CatBoost was chosen for its listed benefits, mainly in requiring less hyperparameter tuning and preprocessing of categorical and text features. It is also fast and fairly easy to set up.

<img src="https://cdn.britannica.com/39/7139-050-A88818BB/Himalayan-chocolate-point.jpg"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />


In [1]:
#Installing and Importing required libraries
! pip3 install --user catboost
! pip3 install --user ipywidgets
! jupyter nbextension enable --py widgetsnbextension

import pandas as pd 
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score 
from catboost import Pool, CatBoostClassifier

You should consider upgrading via the '/Users/yuksel/opt/anaconda3/bin/python3 -m pip install --upgrade pip' command.[0m


You should consider upgrading via the '/Users/yuksel/opt/anaconda3/bin/python3 -m pip install --upgrade pip' command.[0m
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [2]:
# Reading data
df = pd.read_csv('data.csv')

# Splitting
training = df.head(50000)
holdout_set = training.sample(5000, random_state=1) # pick 5000 observations randomly
training = training.drop(holdout_set.index) # Remove holdout from training data
testing = df.tail(5000)

#Looking at counts per genre
df['Genre'].value_counts()

Rock       24486
Pop        16251
Hip Hop     9263
unknown     5000
Name: Genre, dtype: int64

In [3]:
# Splitting training/testing set to feature (X) and labels (y)
train_y = training.Genre
train_X = training.drop('Genre', axis=1)

test_X = testing.drop('Genre', axis=1)

test_X

Unnamed: 0,Lyric
50000,"Feels so good,. Feels so good,. Feels so good ..."
50001,"Shadow of a doubt. I heard your heart,. you he..."
50002,Slaves. Hebrews born to serve to the pharaoh. ...
50003,You've been picked and it's over. What's the c...
50004,Magic happens. But only if you are open to the...
...,...
54995,I can't believe what you did to me. Down on my...
54996,Have all the songs been written?. Have all the...
54997,Everything you do you do so right. The clothes...
54998,(trecho). (Rule Number Two. Understanding what...


In [5]:
# Build a classifier
text_features = ['Lyric']


train_dataset = Pool(data=train_X,
                     label=train_y,
                     text_features=text_features)

model = CatBoostClassifier(iterations=100,
                           learning_rate=1,
                           depth=5,
                           loss_function='MultiClass')

model.fit(train_dataset)

0:	learn: 0.8093706	total: 204ms	remaining: 20.2s
1:	learn: 0.7672922	total: 383ms	remaining: 18.8s
2:	learn: 0.7547202	total: 538ms	remaining: 17.4s
3:	learn: 0.7451655	total: 725ms	remaining: 17.4s
4:	learn: 0.7425807	total: 874ms	remaining: 16.6s
5:	learn: 0.7348963	total: 1.03s	remaining: 16.1s
6:	learn: 0.7305562	total: 1.18s	remaining: 15.7s
7:	learn: 0.7265356	total: 1.35s	remaining: 15.5s
8:	learn: 0.7236361	total: 1.51s	remaining: 15.2s
9:	learn: 0.7214462	total: 1.66s	remaining: 14.9s
10:	learn: 0.7199267	total: 1.8s	remaining: 14.6s
11:	learn: 0.7176381	total: 1.95s	remaining: 14.3s
12:	learn: 0.7126308	total: 2.13s	remaining: 14.3s
13:	learn: 0.7106341	total: 2.27s	remaining: 14s
14:	learn: 0.7080899	total: 2.42s	remaining: 13.7s
15:	learn: 0.7062654	total: 2.57s	remaining: 13.5s
16:	learn: 0.7047084	total: 2.71s	remaining: 13.2s
17:	learn: 0.7034535	total: 2.85s	remaining: 13s
18:	learn: 0.6994373	total: 3.01s	remaining: 12.8s
19:	learn: 0.6973022	total: 3.17s	remaining: 1

<catboost.core.CatBoostClassifier at 0x7fb6f45e2c70>

In [6]:
# Estimate accuracy
pred = model.predict(holdout_set.drop('Genre',axis=1))
estimated_accuracy = accuracy_score(holdout_set['Genre'], pred)
print(estimated_accuracy)
pd.Series(estimated_accuracy).to_csv('ea.csv', index=False, header=False)

0.6796


In [7]:
# Predict testing set
pred = model.predict(test_X)
print(pred.flatten())
pred = pd.Series(pred.flatten()).to_csv('pred.csv', index=False, header=False)

['Pop' 'Rock' 'Rock' ... 'Rock' 'Pop' 'Rock']


In [8]:
# to check number of instances of each genre in pred
np.unique(model.predict(test_X), return_counts=True)

(array(['Hip Hop', 'Pop', 'Rock'], dtype=object), array([ 821, 1385, 2794]))

In [None]:
import pickle
pickle.dump(model, open('model.pickle', 'wb'))