Tips for PreTraining BERT from scratch

Take this with a grain of salt, but I heard that BERT-large can’t be trained without a TPU because it has too many parameters to fit into GPU memory.