Normal GPT 2 Architecture with below config trained on subset of openwebtext
n_ctx=256, n_positions = 256, n_layer = 6, n_embd = 384, n_head = 6,