I have finally started training! It looks as though it will take ~9-10 hours based on 11% progress after ~1hr. Not too bad. (NB I did start training from scratch.)
Question: 112k images and batch size = 8… that’s quite a small batch size.
Was it specified after experimenting with other batch sizes? I wonder whether training would be faster with larger batches (answer: surely yes, PL has callbacks on batch end and there’s other overhead, but I wonder how much data is being transferred - inefficiently?- to the GPU per batch. In theory, I think my GPU could keep ALL data in memory, I’m just not sure how this bit works!)
Info for @robertl… it took ~3 minutes for the training panels to become “active”, during which time there is no indication of what is happening or how long it thinks it will take before training starts - is that something that can be added.
And… recalling a slack discussion about training slow-down there seem to be a number of things that could be done to speed things up…
- The info display is absolutely great for model development - it’s a major strength
- However, once the model is “developed” and ready for a full training run you could
2a. Still acquire data, but turn off live updates to screen until e.g. refresh button is clicked (or do every nth batch, or…)
2b Don’t acquire data at all (if that makes it even faster.) and just show a message "Disabled for
Could PL give simple ETA for training completion? One can sort of do it in one’s head based on progress but it would be nice to have a date/time for completion.
What’s the internal thinking on performance these days?