T5 Finetuning Tips

Sure thing @valhalla. I did not try too many settings… but LR 0.001 seems to work just fine for smaller finetuning batches. I’m running global batch of 2*8 [2 per GPU] – though with a bit of gradient accumulation (4x I believe) but tbh it’s not really that sensitive as far as I can tell. The only gotcha is to turn off those extra scaling parameters that FAIR-seq threw in there and set True by default for no good reason. (scale_parameter=False, relative_step=False)

To get bigger batches, I’m pretty sure we need to add some gradient checkpointing to the model. Trying that out next…

2 Likes