Feature request: multiple instances

JulianSMoore · 22 April 2021 08:04

It would be very helpful to be able to train a model and design/edit a model at the same time.

Most of the PL examples to date use relatively small data sets or limited numbers of epochs; when the data set is large, more epochs or greater accuracy - or hyperparameter tuning is going on - an instance could be completely tied up for hours or days (*).

It would be very nice/becoming necessary for serious use, to be able to run multiple instances so that at the very least one could model and train at the same time.

Who knows, maybe TF2 and CUDA are clever enough that I could even train several models at once on a single GPU (it might not be the most efficient approach though!). At the very least I ought to be able to train 2 (1x CPU, 1x GPU) and edit at the same time…

(*) speaking of which - does checkpoint data allow interrupted training to be resumed? So that, if I started a training run that would take 7 days and after 6.5 days Windows crashed, could I restart and only have to train from the last checkpoint?

Checkpoint interval… something else for user prefs

robertl · 22 April 2021 12:46

Hey @JulianSMoore,
I have some good news for you, this feature is something we have in the pipeline to develop for cloud, and the development on it should start in just some month.
When it’s done, we will spin up a new instance for each time you press the Run button or run a test

speaking of which - does checkpoint data allow interrupted training to be resumed? So that, if I started a training run that would take 7 days and after 6.5 days Windows crashed, could I restart and only have to train from the last checkpoint?

Yup they do.

JulianSMoore · 22 April 2021 12:56

Good news on both fronts! Thanks