How to interpret this loss behaviour

One of a series of regressions…

Training on a few hundred rows this time - x & noisy-y

Uniform-X, Nyqist x 100.zip (28.0 KB)

The (quadratic) loss over all epochs looks very… constant. Should I believe this? And how can the validation loss be less that the training loss? If dropout were enabled that would be possible in theory, but both dropout and batch normalisation are off on all dense layers.

NB I did mis-specify the activation functions, but changing them ALL to ReLU made no difference

For the curious, the function looks like this: red dots are the true values uniformly sampled on the domain, the green crosses are the true values at the sampled X values, and the blue dots are the y values with additional noise. The objective is to assess how well the model predicts between sample points as training progresses. The function is just arbitrarily wiggly, but with Dirichlet boundary conditions (derivative values at each end) and a maximum component frequency so that there is a well-defined Nyquist sampling frequency.

(I have other versions in which x is also random)

Hey @JulianSMoore,

This looks like a model that’s not learning anything (how does the gradients look like?) and therefore just repeating the same inference patter with no significant updates happening to the model.

The validation loss can be less than the training loss without it being strange, that just means that it’s predicting on the few validation samples better than it’s predicting on the training samples.

What happens if you increase or decrease the model size a bit? Or try with another learning rate? :slight_smile:

Hmmm.

I agree that I have yet to match the complexity of the problem to the complexity of the model but: look at the loss in one epoch - neither training nor validation end up at the same value as they started out so why is the all-epochs line flat?

I’ll do some more digging later :wink: but IIRC gradients changed a lot very early on and then flattened out, but the all-epochs loss is completely flat and I’m having a hard time trying to understand that. But, I will be digging deeper. Watch this space :wink:

look at the loss in one epoch - neither training nor validation end up at the same value as they started out so why is the all-epochs line flat?

Ah, that’s because these are just per-iter losses, meaning that the values (batch/sample) that go through the model at the end is different from the ones at the start.
If we assume that the model is not learning and the loss reflects the quadratic difference between prediction and target, then surely it wouldn’t be that strange that two different batches with different samples in them gets different results? :slight_smile:

but IIRC gradients changed a lot very early on and then flattened out, but the all-epochs loss is completely flat and I’m having a hard time trying to understand that

If it did learn at the start of the first epoch and then flattened out, then it’s a bit strange as you are saying that the epoch loss is the same in the first epoch as it is in all others :thinking:
I need to double check exactly what the epoch loss is, if it’s moving average or just plain average (I was pretty sure it was just plain average).

Oh. My mistake; I have been expecting some view that showed the NET loss at e.g. batch end. If those are the losses per individual sample then it’s much harder to draw conclusions.

Will collect more data! (Though there might be another post from me that shows the gradients… the one where the max/av swap was noted?)

Ah I misspoke a bit, it’s for each iteration which means for each batch (not sample) :slight_smile:
If you are looking for the net loss then you want to look at how the last points in the “per-epoch” graph changes.

LMK if you think some other representation would be simpler!

I don’t know But it does raise the question: what exactly is the “epoch loss” - sum, mean. median… of batch losses?

The epoch loss is the average of the batch losses :slight_smile:

1 Like

Issue solved - and when I work out how to turn it into “advice” I will… UPDATE! When you scan the input data, just make a note of the existence of -ve numeric values and nudge the user if they pick e.g. Relu that the model might not work at all given -ve inputs and Relu’s complete suppression of -ve values

I had all activations set to Relu - but my graph goes -ve! I think the model quickly converged on the best possible solution with that unreasonable constraint :wink:

Now I have leaky relu and loss curves and gradients look much more reasonable! (though scale of loss curve would benefit from zoom)

1 Like