Model Card for: mmx_classifier_microblog_ENv02

Multi-label classifier that identifies which marketing mix variable(s) a microblog post pertains to.

Version: 0.2 from August 16, 2023

Model Details

You can use this classifier to determine which of the 4P's of marketing, also known as marketing mix variables, a microblog post (e.g., Tweet) pertains to:

Product
Place
Price
Promotion

Model Description

This classifier is a fine-tuned checkpoint of [cardiffnlp/twitter-roberta-large-2022-154m] (https://huggingface.co/cardiffnlp/twitter-roberta-large-2022-154m). It was trained on 15K Tweets that mentioned at least one of 699 brands. The Tweets were first cleaned and then labeled using OpenAI's GPT4.

Because this is a multi-label classification problem, we use binary cross-entropy (BCE) with logits loss for the fine-tuning. We basically combine a sigmoid layer with BCELoss in a single class. To obtain the probabilities for each label (i.e., marketing mix variable), you need to "push" the predictions through a sigmoid function. This is already done in the accompanying python notebook.

IMPORTANT At the time of writing this description, Huggingface's pipeline did not support multi-label classifiers.

Working Paper

Download the working paper from SSRN: "Creating Synthetic Experts with Generative AI"

Quickstart

# Imports
import pandas as pd, numpy as np, warnings, torch, re
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from bs4 import BeautifulSoup
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
# Helper Functions
def clean_and_parse_tweet(tweet):
    tweet = re.sub(r"https?://\S+|www\.\S+", " URL ", tweet)
    parsed = BeautifulSoup(tweet, "html.parser").get_text() if "filename" not in str(BeautifulSoup(tweet, "html.parser")) else None
    return re.sub(r" +", " ", re.sub(r'^[.:]+', '', re.sub(r"\\n+|\n+", " ", parsed or tweet)).strip()) if parsed else None
def predict_tweet(tweet, model, tokenizer, device, threshold=0.5):
    inputs = tokenizer(tweet, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
    probs = torch.sigmoid(model(**inputs).logits).detach().cpu().numpy()[0]
    return probs, [id2label[i] for i, p in enumerate(probs) if id2label[i] in {'Product', 'Place', 'Price', 'Promotion'} and p >= threshold]
# Setup
device = "mps" if torch.backends.mps.is_built() and torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
synxp = "dmr76/mmx_classifier_microblog_ENv02"
model = AutoModelForSequenceClassification.from_pretrained(synxp).to(device)
tokenizer = AutoTokenizer.from_pretrained(synxp)
id2label = model.config.id2label
# ---->>> Define your Tweet  <<<----
tweet = "Best cushioning ever!!! 🤗🤗🤗  my zoom vomeros are the bomb🏃🏽‍♀️💨!!!  \n @nike #run #training https://randomurl.ai"
# Clean and Predict
cleaned_tweet = clean_and_parse_tweet(tweet)
probs, labels = predict_tweet(cleaned_tweet, model, tokenizer, device)
# Print Labels and Probabilities
print("Please don't forget to cite the paper: https://ssrn.com/abstract=4542949 in you use this code")
print(labels, probs)

Conveniently predict thousands tweets with the batch processing python notebook, available in my GitHub Repository

Citation

Please cite the following reference if you use synthetic experts in your work:

Ringel, Daniel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023). Available at SSRN: https://ssrn.com/abstract=4542949