My termites win
- Aug 6, 2007
I am familiar with the transformer model, and the "Attention is all you need" paper. It's a very general maxim of avoiding overfitting. You'd need to explain the actual reason why this model isn't overfitting. At base whatever your network, not matter what it is, you are fitting a multidimensional model. If your wieghts/parameters are under-determined or even perfectly determined, you would be responding to noise. One test would be to do a cross-validation check of taking random splitting of training (restarting from your baseline) and test in multiple ways and measure the Bayesian Information Criterion the Akaike Information Criterion or some similar metric.I think Transformers using neural networks may operate differently than you are thinking. We used a T5-small model, which provides eight-headed attention across the encoder and decoder, resulting in approximately 60 million parameters.
If your model had pretrained embeddings and transfer learning of a model previously trained, the need to have the number of parameters be small compared to your data points is no longer an issue (because the pre-training was done on ridiculous amounts of data to start).
Edit: Found out your model was indeed pre-trained.