SHARDS:
I think that in general the shards need to be around 1-2 gig ... as this also helps with merging in the colab... hence often we find elements we wish to absorb but find it hard because not everyone buys the colab upper levels etc... i could also probably train on my machine (ASsus ROG scar) but the downloading of the models is the man issue due to size. i also found that models which did not run on hf actually could run quite fast on local machine!
only my thoughts ... i'm soon ready for fine tuning so i will keep watch!
Thanks good work
I also figured out that by specifying 1b (1 billion params) as a shard size 16bit they are 1.9gb if 32bit they are double sized.
Also I did notice an improvement after your model was in-cooperated (a pinch of spice)....