Disconnection with the web end and trainer

RBR · 27 October 2021 20:13

This will be a little hard to describe so bear with me.
When I do a long training session 3-4 days there seems to be a disconnection that occurs between the trainer and web end to were it stops receiving updates from the trainer like status, percentage, loss, accuracy all of it. This has happened twice and seems to only happen when it’s left to idle after it auto logs you out.

I left it to continue training without receiving updates till I saw the terminal window log post “INFO:perceptilabs.applogger:Finished epoch 10/10” and proceeded to restart Perceptilabs but unfortunately it still thinks the model is being trained doing a validation at 86% which it isnt and suppose it failed to save after epoch.

Not exactly sure what’s going on and it’s a little tricky to log on long runs like this and only things I could catch while it was still in the buffer was some “BadRequest: /mixpanel/track/” and a couple of “POST” with 400 23.

Any ideas?

JulianSMoore · 28 October 2021 08:53

Hi @RBR

Great to be able to greet another new user like me - the more the merrier!

I can’t help with your specific issues (but I’m sure @robertl will) I just wanted to share a couple of pieces of info:

interruption of training by auto-log out: there’s a feature request here on Canny to improve that aspect of operations, and it is currently marked as Planned, so hopefully soon there will be fewer interruptions
long logs are a pain, but have you tried PL with the verbose mode i.e. perceptilabs -v=3 to catch even more data? I think the PL guys would be quite happy with an attached file of any length (I usually copy and paste, but I suppose you could pipe the output directly to a file - at least I hope so)

Finally - 3-4 days of training! That’s a lot something going on I’m always interested in how people are applying ML, their architecture choices, etc, so any time you feel like sharing you have a guaranteed audience here - and who knows, maybe some others are doing something related.

Hope you get it sorted soon…

robertl · 28 October 2021 12:02

Hi @RBR,
Welcome to the forum!

First of all, 3-4 days, sound like you are doing something really cool!
As @JulianSMoore said, if you want to share then we would love to hear.

As for the issue, sorry that you are having trouble with longer training runs.
The feature that @JulianSMoore pointed out will help with that, and we are trying to squeeze it in to the release that’s targeted for the upcoming Wednesday (there’s a ton of new stuff coming out then).
There’s another fix incoming as well that lets you fetch data after the training has finished (another bug in the current release version) which also should help this issue.
Between those two fixes, this issue should no longer exist

All the best,
Robert

RBR · 28 October 2021 20:08

The project isnt anything impressive just NSFW image classification, I’m basically reinventing the wheel so I can expand it onto other subjects that are in poor taste and to have the ability to maintain it when additions and changes need to be made. It will be used for sifting through user submitted content for moderation but right now im just trying to get “something” off the ground and so far haven’t been able to do so.

I dont feel that getting rid of the auto logout would solve this problem just move it somewhere else like for example having a browser crash or a user wanting to close their browser. Think the bigger issue is that it’s un-recoverable so far leaving the model project in an unusable state unless you stop the run (which isn’t running) and retrain over it. Also I dont believe duration is factor here as the last run did it at 17% and the other 86% it just has to be long enough for auto logout to happen and wait awhile since it’s not always repeatable.

Edit: I was able to run a test but unable to export the model and shows similar errors I had previously every time I hit the button…

robertl · 29 October 2021 08:03

Hi again @RBR,
Sounds like a great project!

About the issue, from what I read from your first post, it looks like the training completes on the kernel side (it went to 10/10 epochs).
However, there currently is a bug where the frontend does not fetch any values after the training finishes, so even though you have a fully trained model in the kernel you can’t see it in the statistics view.
The second bug fix I mentioned in my first message will make sure that you can always fetch the latest statistics, even if you close and re-open the browser in between

It’s strange that you can’t export it though, does it show anything more than the screenshot you sent or anything in the frontend? Those 400:s you are showing have no impact on usability.

RBR · 30 October 2021 03:47

Unfortunately no at least when the button is pressed, there may be something higher up but would expect an error to pop up when every button action is done. I will see if I can properly capture the log from startup to export and msg it later.

Edit: I’d like to mention I tried the current pl-nightly build and it exhibits the same behaviors with the model although I have yet to train a new one on it to see if this disconnect symptom can happen.