App reports kernel offline. Cant get to statistic view after auto-logout

birdstream · 9 August 2021 10:54

To my delight using images as target now works However, i now encounter two other problems:

App always reports kernel as offline (even though it obviosly isn’t, because i can still train models?) There is no output in console
After being logged out automatically, I cannot get into the statistics view to see progress, the model is still training though. I have to click stop, wait until it stops (this takes a while) and then resume training

robertl · 9 August 2021 15:42

Hey @birdstream,
Haha I’m glad you like the target images fix

For the other issues:

You mean you constantly get the popup? Or that the little circle in the bottom right is constantly red?
If it’s the little circle then that’s just a visual bug, but if it’s the constant popup then something strange is going on because we (supposedly) completely removed it a few releases ago.
You can’t get to the statistics even after refreshing? Adding this as a prio bug regardless

birdstream · 9 August 2021 18:40

Hi Robert!

The circle is constantly red (reads “offline” on hover)

I’m currently working with a relatively large dataset, so maybe it wants to finish the current epoch or whatever is in the queue first? We’re talking about ~1 hour per epoch… Returning to the session and model view, it does read “Training…” by the progress meter, but the statistics button is greyed out. Also, stopping the training seems to work visually, but it’s evidently still training in the background as i can see GPU utilization going about 100% still. I can’t start training again either, it just hangs…

Closing and reopening the browser has the same effect (as being automatically logged out, that is), and i really can’t do much more than Crtl+C and start Perceptilabs again.

robertl · 10 August 2021 08:00

The circle is constantly red (reads “offline” on hover)

You can ignore this one for now, it’s not indicative of the actual status of the Kernel at the moment

For stopping training, there it will always be a bit of a delay because it needs to finish whatever batch or process it’s on, but after that it should stop.
I’ve placed a ticket to look into it more though to see if there is something which doesn’t quite stop as it should.

I’ve placed a ticket to look into the issue where you can’t get the statistics view to re-appear again as well, or at the very least to make sure that you don’t get automatically logged out.

Will keep you updated as things roll in, and thanks for reporting the issues!

Edit: Quick update, I just tested starting training and then tried:

refreshing the frontend
manually logging out and in again
restarting the browser

and I could see the statistics view again afterwards without issue.

It can be that it’s because long updates between each update, would you mind trying the same issue on a smaller dataset and see if it happens there?
For example the mnist_data that comes with the tool, and just set epochs to something large so you have time to try it out.

robertl · 10 August 2021 12:06

By the way @birdstream, it looks like you are doing a really cool use case!
Would you mind / are you able to talk a bit about what that use case is more specifically?
Info like this helps us a lot with knowing how to prioritize different features etc

birdstream · 10 August 2021 20:50

I did as you suggested and reduced the dataset to 10.000 images, and voila: It now works like expected I am now able to close the browser and get back to the statistics view etc. Still, running in to OOM situations (i would assume) and the like seems to just crash everything and I need to force-close and start Perceptilabs again. Frontend works, but starting training etc doesn’t. I do realize i could use a bit more powerful rig to do ML however, cause i guess i threw more at it than it could handle

What I am doing? Well I just find ML and neural networks extremely cool and like to learn about it I started out doing some ML in pytorch (colorization of b/w photos), i did have some success with it, but since i suck at coding i wished there was some neat graphical frontend that would abstract all the boring coding stuff. And guess what, i found it! Right now i’m just experimenting doing 2x super resolution with a basic U-Net like model

If i could wish for something now that would be bringing GAN back Also, it would be great to throw in some learning rate schedulers, too.

JulianSMoore · 11 August 2021 08:33

Hi @birdstream I thought you might be doing some superresolution stuff… that’s cool.

I look forward to seeing what else you come up with in your experiments (I’m trying a big regression on real astronomical data… the technical issues with the type of model in PL are slowly being dealt with and I’ll share my PL model when it works)

Hope the forum is good for you & if you have any suggestions (PL features or the forum) the PL guys are very receptive

robertl · 11 August 2021 13:52

@birdstream great that it works!
Haha, we are getting a good amount of GAN requests, it’s already high on our prio list but maybe we can put it even higher.

One quick tip, try using the UNet component instead of self built. Less components = less visualizations = there is a chance it won’t use as many resources and crash for you.

And agreed with @JulianSMoore, looking forward to see any experiments you come up with

birdstream · 11 August 2021 18:44

I’ll come up with more stuff, no doubt. I love playing around with this
Yes, I have have also tried U-Net with some different backends and that works great

However, now, I experience quite a noticeable slow down in training after 15+ epochs (batch size of 8 on 10.000.images) i can see in the nvidia tool GPU utilization only comes in small bursts, also it seems to munch more and more RAM as training goes on. Everything becomes sluggish after a while and stopping training takes “forever” what could be going on?

robertl · 12 August 2021 07:44

Hmm, we’ll take a look from our side to see what it might be

birdstream · 12 August 2021 17:37

Now that i’ve looked a little bit more at it, it seems like it gets slower by each mini-batch rather than epochs.

robertl · 13 August 2021 08:49

That’s great to know, thanks!
I’ve added it as a bug on our side, but it’s a bit of a larger one so it may take some time before I have any news on it.
Apologies for the inconvenience!

birdstream · 16 August 2021 13:31

It’s cool i really appreciate the work! After digging around the net it seems to me that this may be a Tensorflow/Keras issue? A lot of other ppl have had problems with it running slower and slower as training progresses.

One other thing to add: It seems that, if i do wait until everything’s finished in the background, close down Perceptilabs and restart it, resuming training it’s still really slow

JulianSMoore · 16 August 2021 15:10

Hi @birdstream

If you still have any links to share on training slowdown I’d love to see because I suspect this issue only affects certains kinds of models and it would be good to be aware of the current situation generally. I’ve not done any CNN models for a while and I don’t have any slowdown issues* and I’m guessing this might be cuDNN related.

(* In fact I have been most surprised that TF/GPU (pure Python) combination is remarkably consistent in training speed across ~1-2 orders of magnitude on the #trainable params: from 3k to 70k batch time was almost constant - and did not degrade over the course of a couple of hours - using only Dense (with input normalizer))

birdstream · 16 August 2021 18:28

I suppose there can be more than one reason it slows down, but one thing that gets mentioned a lot is that certain operations gets added to the graph. Like this one post i found on reddit https://www.reddit.com/r/tensorflow/comments/b9bnvx/why_is_this_code_slowing_down_so_much/?utm_medium=android_app&utm_source=share

But like i said earlier my programming skills are “meh” so i really couldnt tell.

My model in this case is a bunch of convulutional layers and a dense layer in the middle

Layer (type)                 Output Shape              Param #   
=================================================================
low (InputLayer)             [(None, 56, 56, 3)]       0         
_________________________________________________________________
training_model (TrainingMode ({'high': (None, 224, 224 3672749   
=================================================================
Total params: 3,672,749
Trainable params: 3,672,749
Non-trainable params: 0

JulianSMoore · 17 August 2021 07:20

Hi @birdstream - thanks for the extra input. My guess that cuDNN/convolutions might be at the heart of the issue is upgrade to belief I followed your link and googled for cuda cudnn slowdown, leading to another possibility worth checking (no programming required ;)…

Are you training on a laptop? If so, check (e.g. with hwInfo) CPU & GPU temperature - it could be thermal throttling.

If training on a desktop: how is memory usage changing over time? (another reddit thread)

birdstream · 17 August 2021 09:39

I’m training on a desktop, RAM usage is about 2,5GB in the beginning and ends up being around 3-3,5 after a while, especially if i stop/run the model during this time… maybe its related to the slowdown, but i dont think thats the reason. It most certainly isnt throttling, anyway. What i can see in the GreenWithEnvy (nvidia gpu tool) during training, the GPU get utilized in bursts and the longer the training runs, the time between these bursts gets longer. Thats the best way i can describe it

JulianSMoore · 17 August 2021 13:37

@birdstream - that’s great input… I don’t know what it means (apart from not thermal throttling, obviously) but it’s really good to have that sort of info ( , well, I could guess a bit, but the point is that the PL guys would guess better!). If you don’t mind sharing your system spec that could be handy to know too…

birdstream · 17 August 2021 17:13

I will provide as exhaustive information as possible

H/W path       Device      Class          Description
=====================================================
                           system         All Series (All)
/0                         bus            Z97-P
/0/0                       memory         64KiB BIOS
/0/42                      memory         16GiB System Memory
/0/42/0                    memory         4GiB DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
/0/42/1                    memory         4GiB DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
/0/42/2                    memory         4GiB DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
/0/42/3                    memory         4GiB DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
/0/51                      processor      Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
/0/51/52                   memory         256KiB L1 cache
/0/51/53                   memory         1MiB L2 cache
/0/51/54                   memory         8MiB L3 cache
/0/100                     bridge         4th Gen Core Processor DRAM Controller
/0/100/1                   bridge         Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller
/0/100/1/0                 display        GM204 [GeForce GTX 980]
/0/100/1/0.1               multimedia     GM204 High Definition Audio Controller
/0/100/2                   display        Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller
/0/100/3                   multimedia     Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller
/0/100/14                  bus            9 Series Chipset Family USB xHCI Controller
/0/100/14/0    usb3        bus            xHCI Host Controller
/0/100/14/0/d              input          Sensei Raw Gaming Mouse
/0/100/14/0/e              input          Gaming Keyboard G105
/0/100/14/1    usb4        bus            xHCI Host Controller
/0/100/16                  communication  9 Series Chipset Family ME Interface #1
/0/100/1a                  bus            9 Series Chipset Family USB EHCI Controller #2
/0/100/1a/1    usb1        bus            EHCI Host Controller
/0/100/1a/1/1              bus            USB hub
/0/100/1b                  multimedia     9 Series Chipset Family HD Audio Controller
/0/100/1c                  bridge         9 Series Chipset Family PCI Express Root Port 1
/0/100/1c.2                bridge         9 Series Chipset Family PCI Express Root Port 3
/0/100/1c.2/0  enp3s0      network        RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
/0/100/1c.3                bridge         82801 PCI Bridge
/0/100/1c.3/0              bridge         ASM1083/1085 PCIe to PCI Bridge
/0/100/1d                  bus            9 Series Chipset Family USB EHCI Controller #1
/0/100/1d/1    usb2        bus            EHCI Host Controller
/0/100/1d/1/1              bus            USB hub
/0/100/1f                  bridge         Z97 Chipset LPC Controller
/0/100/1f.2                storage        9 Series Chipset Family SATA Controller [AHCI Mode]
/0/100/1f.3                bus            9 Series Chipset Family SMBus Controller
/0/1           scsi0       storage        
/0/1/0.0.0     /dev/sda    disk           240GB Corsair Force 3
/0/1/0.0.0/1   /dev/sda1   volume         511MiB Windows FAT volume
/0/1/0.0.0/2   /dev/sda2   volume         223GiB EXT4 volume
/0/2           scsi1       storage        
/0/2/0.0.0     /dev/cdrom  disk           CDDVDW SH-224BB
/0/3           scsi2       storage        
/0/3/0.0.0     /dev/sdb    disk           2TB ST2000DM001-1CH1
/0/3/0.0.0/1   /dev/sdb1   volume         1468GiB EXT4 volume
/0/3/0.0.0/2   /dev/sdb2   volume         394GiB EXT4 volume
/0/4           scsi3       storage        
/0/4/0.0.0     /dev/sdc    volume         1863GiB ST2000DM001-1CH1
/1                         power          To Be Filled By O.E.M.

OS information:

Linux joakim-All-Series 5.12.0-19.3-liquorix-amd64 #1 ZEN SMP PREEMPT liquorix 5.12-29ubuntu1~bionic (2021-07-25) x86_64 x86_64 x86_64 GNU/Linux

Nvidia…

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 29%   35C    P8    16W / 180W |    249MiB /  4042MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1273      G   /usr/lib/xorg/Xorg                 15MiB |
|    0   N/A  N/A      1483      G   /usr/bin/gnome-shell               64MiB |
|    0   N/A  N/A      4653      G   /usr/lib/xorg/Xorg                107MiB |
|    0   N/A  N/A      4785      G   /usr/bin/gnome-shell               19MiB |
|    0   N/A  N/A     26189      G   ...AAAAAAAAA= --shared-files       31MiB |
+-----------------------------------------------------------------------------+

JulianSMoore · 17 August 2021 18:27

From TensorFlow site the current supported CUDA version seems to be 11.2 (with cuDNN 8.1) for TensorFlow 2.5

I notice you have cuda toolkit 11.4.

Are you able to test with the recommended cuda toolkit and cuDNN versions? (I know from experience that running/good running can be quite sensitive to the cuda/cuDNN versions…)