Difficult prompt you may want to add to training data
Thanks for an impressive sql coder!
Although it came closer than many other sql coder models I have seen on huggingface, it still did not quite nail the prompt below, which GPT-3.5 seems to nail.
Feel free to add the prompt to your training data if interested in improving the model further.
Question:
"Among the artists having concerts in year 2020, which artist has a song whose title equals the name of the stadium on which the concert takes place?"
Database Schema:
"""
CREATE TABLE stadium (
stadium_id number,
location text,
name text,
capacity number,
highest number,
lowest number,
average number
)
CREATE TABLE singer (
singer_id number,
name text,
country text,
song_name text,
song_release_year text,
age number,
is_male others
)
CREATE TABLE concert (
concert_id number,
concert_name text,
theme text,
stadium_id text,
year text
)
CREATE TABLE singer_in_concert (
concert_id number,
singer_id text
)
"""
Invalid response, as retreived at https://defog.ai/sqlcoder-demo/:
"""
SELECT singer.name
FROM singer_in_concert join singer on singer_in_concert.singer_id = singer.singer_id join concert on singer_in_concert.concert_id = concert.concert_id
WHERE concert.year = '2020'
and singer.song_name = stadium.name
GROUP BY singer.name;
"""
Valid(?) response, as given by GPT-3.5:
"""
SELECT s.name AS artist_name
FROM singer s
JOIN singer_in_concert sic ON s.singer_id = sic.singer_id
JOIN concert c ON sic.concert_id = c.concert_id
JOIN stadium st ON c.stadium_id = st.stadium_id
WHERE c.year = '2020' AND s.song_name = st.name;
"""
HI there, thank you for testing out the model – and for opening an issue! We will definitely incorporate queries like this (without column descriptions of join hints) in our training data in the future. Currently, our model performs best when column descriptions and join hints are defined.
This is because – unlike gpt-3.5-turbo – we have not yet done instruction fine-tuning with RLHF or Reward Modeling. We will do that soon.
With the metadata defined like below, the model does give the right answer to this question. But in the future, we hope to start returning high quality without any metadata definitions.
Query generated with additional metadata defined
SELECT singer.name,
singer.song_name
FROM singer_in_concert join singer on singer_in_concert.singer_id = singer.singer_id join concert on singer_in_concert.concert_id = concert.concert_id join stadium on concert.stadium_id = stadium.stadium_id
WHERE concert.year = '2020'
and singer.song_name = stadium.name
ORDER BY singer.name;
Additional metadata
CREATE TABLE stadium (
stadium_id number, --unique id for each stadium
location text, --city and state
name text, --name of the stadium
capacity number, --maximum number of people that can be seated
highest number, --highest number of people that have attended a concert
lowest number, --lowest number of people that have attended a concert
average number --average number of people that have attended a concert
)
CREATE TABLE singer (
singer_id number, --unique id for each singer
name text, --name of the singer
country text, --country of origin of the singer
song_name text, --name of the song
song_release_year text, --year the song was released
age number, --age of the singer
is_male bool --whether or not the singer is male
)
CREATE TABLE concert (
concert_id number, --unique id for each concert
concert_name text, --name of the concert
theme text, --theme of the concert
stadium_id text, --id of the stadium where the concert takes place
year text --year the concert takes place
)
CREATE TABLE singer_in_concert (
concert_id number, --id of the concert
singer_id text --id of the singer
)
-- stadium.stadium_id can be joined with concert.stadium_id
-- singer.singer_id can be joined with singer_in_concert.singer_id
-- concert.concert_id can be joined with singer_in_concert.concert_id