Increasing the number of images increases the loss

JWalker · 21 November 2021 07:25

This is a regression problem and I was surprised to see this. When it happened before I increased the batch size and that helped a lot, but with a small increase in memory usage. However, I increased the number of images again and the loss increased again.

Is this what I should expect. Are there any other tricks to try when this happens as I increase the number of images?

I have 2000 images with a batch size of 64. I am wondering if there is a semi-optimal image to batch size ratio that we ought to aim for?

JulianSMoore · 21 November 2021 13:02

@robertl and I have been discussing the phenomenon of Double Descent and you have just encountered one aspect of the way that the learning of neural networks is affected by the number of epochs, the number of parameters in the model - or the amount of data. I can give you a bibliography of papers on the topic if you are interested (I skim for accessible stuff but I have the full papers in Zotero)

TL;DR Yes, this is exactly what I would expect. Simple heuristic - more data = more adjustments to weights required to find the best match to ALL of them. Therefore more epochs, or some other adjustments required.

As to optimal batch size, learning rates, etc. Welcome to the Threshold… beyond here it is either deep research, “hyperparameter tuning” (ML aided guesswork) or “skill” to find the right set of things…

The “ratio” you are seeking - if it exists at all - is probably a function of many other model parameters (#params, activations, breadth/depth ration, etc.) so, unless someone else knows otherwise (in which case I am very interested to learn more) I think you will end up learning your own rules of thumb - that may or may not be transferrable.

(Any time you can share a description of your model and some sample inputs etc. would help put all this into a specific context. For example, you say images and regression… logistic regression?)

PS This is one of the major reasons I like the interactivity and graphs of PL… one can see the behaviour very quickly and easily. (Yes, there is TensorBoard but that has always been more challenging for me )

JWalker · 22 November 2021 12:07

Thanks Julian,

At least now I have a name, I hadn’t heard about Double Descent before so now I have somewhere to start and yes, I would love to see your papers.

I am trying to model the FX market (yes, not as altruistic as teaching students) and it was always going to be a tough, but I guess that it starts me on my ML learning journey.

This is teaching me about the PL interface and what all the charts mean. I have a question, is the loss the sum of (1/N) * (pred - actual)^2 or the sum of abs(pred - actual)? The reason I ask is that the first one allows me to calculate a standard deviation.

I don’t know what type of regression this is. I select image as the input and a numerical output in the data wizard. I don’t know how else to select the regression type. I assumed that PL was doing that for me. Sorry for my incompetence, I guess making the interface user friendly means that people with lower levels of skill will start playing with it.

Here is an example of the type of image that I am trying to analyse.

Thanks again for all your help and patience.

JulianSMoore · 22 November 2021 12:31

Reply Pt1

Here are the references - everything either tagged with “Double Descent” or contain that phrase. The Belkin papers are highly cited

Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. ‘Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-Off’. ArXiv:1812.11118 [Cs, Stat] , 10 September 2019. http://arxiv.org/abs/1812.11118.
Belkin, Mikhail, Daniel Hsu, and Ji Xu. ‘Two Models of Double Descent for Weak Features’. SIAM Journal on Mathematics of Data Science 2, no. 4 (January 2020): 1167–80. https://doi.org/10.1137/20M1336072.
Berner, Julius, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. ‘The Modern Mathematics of Deep Learning’. ArXiv:2105.04026 [Cs, Stat] , 9 May 2021. http://arxiv.org/abs/2105.04026.
E, Weinan, Chao Ma, Stephan Wojtowytsch, and Lei Wu. ‘Towards a Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don’t’. ArXiv:2009.10713 [Cs, Math, Stat] , 7 December 2020. http://arxiv.org/abs/2009.10713.
‘Every Data Scientist Should Know: The Bias-Variance Trade-off Generalization Is Wrong | by Andrew Young | Towards Data Science’. Accessed 5 October 2021. https://towardsdatascience.com/something-every-data-scientist-should-know-but-probably-doesnt-the-bias-variance-trade-off-25d97a17329d?gi=573fd835f05a.
James, G., D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: With Applications in r . Springer Texts in Statistics. Springer US, 2021. https://books.google.co.uk/books?id=g5gezgEACAAJ.
Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. ‘Deep Double Descent: Where Bigger Models and More Data Hurt’. ArXiv:1912.02292 [Cs, Stat] , 4 December 2019. http://arxiv.org/abs/1912.02292.
Power, Alethea, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. ‘Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets’, n.d., 9.
Vinyals, Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol. ‘Understanding Deep Learning (Still) Requires Rethinking Generalization’. Accessed 26 October 2021. https://cacm.acm.org/magazines/2021/3/250713-understanding-deep-learning-still-requires-rethinking-generalization/fulltext.

JulianSMoore · 22 November 2021 12:39

FX eh? I wrote a Chaos detector for that as an adjunct to time-series forecasting (so the nn could learn not to predict the unpredictable ) but never actually got around to linking the chaos detector to a nn.

And the detector is in Mathematica code… I have been toying with the idea of porting it to python IF I can refactor it in a certain way.

But anything that motivates ML learning is, er, a good motivation for learning ML

Re losses: I have seen a plot in PL of mean absolute error, I haven’t seen one for mean squared (or RMSE)… however, for regression with Quadratic loss, the loss involves the sum of the squares of the differences. Since N doesn’t change, for efficiency the 1/N could be left out, but it’s only one op so it might be in…

Here you will find all the keras losses but which one of the mean squared PL uses I don’t know.

JulianSMoore · 24 November 2021 12:46

Hi @JWalker

TL;DR Smaller batches are usually better with SGD

Just came across this paper:

Keskar, Nitish Shirish, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. ‘On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima’. ArXiv:1609.04836 [Cs, Math] , 9 February 2017. http://arxiv.org/abs/1609.04836.

which has in the abstract:

It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known, sharp minima lead to poorer generalization.

and

small-batch methods consistently converge to flat minimizers

Which is, I think, key to successful generalisation (I speculate that adding weight decay is also good for producing flat minimizers - if anyone knows of any literature on this please let me know)

I don’t know what the situation is for other optimisers (Adam, AdaGrad, SGD with momentum etc.) but it seemed like a good idea to record somewhere around here that for SGD at least, smaller is better

JWalker · 25 November 2021 12:51

Thanks Julian, this is very interesting. Unfortunately, I am now having trouble encouraging PL to work. I will be chatting with Robert in the other thread until I can resolve this issue.

JWalker · 25 November 2021 14:15

Hi @JulianSMoore

Ok, so if I use a small batch size, do I need to increase the number of epochs?

For example, if I have 2000 images and use a batch size of 32 with 1000 epochs, then I am only using my data 1.6 times over (3200 images in total). Am I understanding that correctly? Might it be better to double the number of epochs in that case?

JulianSMoore · 25 November 2021 15:46

Using smaller batches should help avoid “sharp minimisers” (poor generalisers) but the number of epochs should, I think be decided by the behaviour of the model rather than up front unless there are specific reasons to favour some number.

(That said, 1000 epochs feels like quite a lot - at least for initial exploration.)

Regardless of the absolute #epochs intended, I would be looking at the training (dashed line in image below) and validation/test loss behaviours (solid line).

If you see your validation loss going up after the minimum in the “classical regime” you are entering the overfitted but not-generalised regime and you could consider stopping at the validation minimum because it will take several times that #epochs (there’s no scale on this figure ) to generalise back down and to further reduce the validation error (in the “double descent”/generalisation/“modern interpolating” regime).

Caveat: double descent doesn’t always occur - it does depend on regularisation and optimisation methods and I don’t have a good handle on all the possible combinations of known values (that would be real research I think)

But, if you have a lot of data, it could take a long time to see what is happening at all (training and validation losses both still falling at the end of your #epochs), in which case one might want to try a smaller (representative) dataset so that model behaviour can be assessed.

Unfortunately one then needs to balance the “power” of the model to fit your data exactly against the amount of data!

Too much data and the behaviour of your model might not emerge quickly enough, too little and you’ll see a different behaviour!

Since I don’t have any formulae for finding the sweet spot that’s why I would recommend starting small and quick and adding data/epochs according to what you see in the behaviour.

Perhaps not the definitive answer you were hoping for… I’d be very interested in everyone else’s thoughts on this too.

(As for your specifics and the 1.6x data… I didn’t see/have forgotten where the 1.6 came from: if using the default splits 70:20:10 every epoch uses 0.7x your data for training; 1000 epoch = 700x your data. Can you clarify?)

JWalker · 26 November 2021 01:06

Thank you again @JulianSMoore

Sorry, the 1000 was a typo, I meant 100. So 32 (batch size) x 100 epochs, which is the default setting. In that case I believe that I would be using 2240 images for training and the remainder for validation and testing, is that correct?

My thoughts about increasing the number of epochs were that I was going to play about with the dropout ratio and try to increase training loss with only small increases to validation loss. If I can do that, then I hope that increasing the number of epochs would improve both and lead to a better model.

One useful part of the final paper that you linked was in appendix B were they describe the models that they use. For example,

For this network, we use a 784-dimensional input layer followed by 5 batch-normalized layers of 512 neurons each with ReLU activations.

I had no idea of the complexity of some of these models. That must take some serious computing power!

JulianSMoore · 26 November 2021 09:57

If you have 2000 images, at the default setting (70:20:10) it will use 1400 for training, 400 for validation and 200 for test per epoch.

Note that PL makes the split early on and then does not mix samples between them, so the 1400 will be used for training 100 times, 200 100 times for validation, etc. Randomisation only occurs within splits.

Yes, increasing dropout is one way of trying to avoid classical overfitting - it (somehow, I know the principle but the specifics elude me) helps prevent memorisation of the training data if the model size is large compared to the dataset size. (I should really get a better handle on Batch Normalisation, which I don’t really understand right now)

Model Size

Within TF there’s a model.summary function that will tell you how many parameters a model has, hence what load it would present - it’s not accessible via PL right now (but see this feature request and upvote if you like it). However, you can calculate by hand see e.g. here. Maybe @robertl has other ideas. It’s certainly not a small model, but with GPU (I have a 1080 Ti) I found very little difference in training speed up to about 75k parameters - haven’t tried anything really big myself)

Dropout Can Cause Validation Loss < Training Loss

One thing about dropout is that (see e.g. here, but this google query will lead to other reasons, such as biased splits ) the loss on training is calculated with the effects of dropout but not during validation/test. Thus if your dropout rate is “high” validation scores can in fact be lower than the training scores, which goes against naive expectation.

I would therefore consider an initial run to see what the behaviour is before adjusting dropout or anything else (such reordering your CSV)

A Very Non-Standard Approach

Here’s a technique for effectively NON RANDOM sampling I came up with specifically so I could inspect for generalisation: if it is possible to rank the targets on quality/“difficulty” (I used the std. dev. of observations in the regression inputs… low SD data was better/easier), manually order your data so that the hardest examples are represented in both the training and test datasets, but with fewer “easy” targets in the test data set (watch out if you change the split % afterwards though).

Then, if your model performs well in test you need to worry less about sampling effects (of course the more data you have, the less sample sizes can affect things). Normally, better performance on test is a bad sign, but with this approach, better performance on test can give confidence in model generalisation.

JulianSMoore · 26 November 2021 14:59

@JWalker and anyone else interested - this is a lovely discussion of batch sizes and the loss landscape - where you’ll find these (and other) images animated showing how the landscape is a sample of “the landscape” - and thus dependent on the batch (size), with greater variation between small batches than between large batches.

(You can’t tell much from these static images, but they look nice)

JWalker · 27 November 2021 12:10

Thanks again Julian,

I made a mistake before, I thought that only one batch was trained during one epoch, I didn’t realise that all batches were trained during an epoch, that makes more sense now.

Here is a video on batch normalisation. I use deep lizard for a lot of my learning on ML and their video on convolution was very helpful. My only question would be whether or not we ought to have batch normalisation on the final layer before the output?

At the moment, my model is a depthwise convolution (ReLu, batch normalised) -> dense (128) -> dense(1) -> output. There are no activation functions on the dense layers.

I think that it makes sense to add another dense(128) layer and also to add batch normalisation to the dense layers. My logic is that because increasing batch size hasn’t improved loss compared to fewer images (1500) and using dropouts has increased loss, then I probably need more complexity and I ought to have been using batch normalisation to help convergence.

This is what I will try next, but it take a few days because training takes several hours with my current image size. Frustratingly, I left two GTX1070s behind in my last country, the cost of gpus is silly and my little gtx1650 cannot do a lot with 4gb of vram.

JulianSMoore · 27 November 2021 15:12

Thanks for the deep lizard link; that was very helpful. My first thought on watching was that batch norm is obviously important for e.g. ReLU, but I wasn’t so sure about tanh or sigmoid activation functions, but after a further bit of digging I found this , which emphasises that batchnorm helps keep the majority of outputs in the linear rather than saturated regions of tanh/sigmoid, which is good for speed of training etc.

With all that extra input I would agree that it makes sense to add batchnorm to the dense layers as well.

Sorry to hear about the 1070’s I hope they went to good homes (or are safe somewhere)

I agree about prices - it’s now possible to buy a gaming PC with GPU and recoup the entire cost by reselling the GPU… or look on it as buying a GPU and getting the rest of the PC free. It’s nuts.

JWalker · 28 November 2021 07:52

Yes, the 1070s went to good homes, kinda upset that I was forced to do that though.

So I added the batch normalisation and I still calculate a validation loss of 390k. This is bizarre because it was only 150k when I used 1500 images. What I have noticed is that my gradient is looking ugly, so I will add some complexity and see if it clears up. I am also checking for errors in my original images/targets.

Can I just check what the x-axis represents on the graph of loss over one epoch, please? I guess that numbers 1-10 are the training loss and the numbers 11-14 are the validation loss. In that case, what is the testing portion of the data used for (I can’t seem to run a post training test in PL).

Please don’t feel obliged to keep answering this thread if you don’t want to. I might use it as a journal as I try to improve the model. That might be helpful to other people who try to develop their models.

JulianSMoore · 28 November 2021 22:30

Hi @JWalker

No worries - I’ve been learning a lot by following up on things you’ve mentioned (so useful learning for me) and now understand why a) batch normalisation is good (sort of) and b) why larger batches may be better with batch norm (it reduces the inter-batch variation by making each batch mean, SD close to the population mean, SD by larger sample)

I’ll post something else tomorrow morning when I’ve checked something, but yes, the model looks screwy to me - mostly because of the enormous range of the gradients.

JWalker · 29 November 2021 09:23

Thanks Julian,

There is something else, PL reports my loss at the end of my final epoch, but that seems to be different from the reported loss over all epochs. Do you know which one I ought to pay attention to?

For example, the final validation loss reported with 2000 images and a batch size is 381k, but the loss over all epochs was reported as 200k:

JulianSMoore · 29 November 2021 12:04

No worries James. I was hoping to find a way for you to output the tensorflow model.summary() for ease of description, but haven’t found a way yet, so if you could you simply describe the layers in sequence, dimensions, #features, #neurons, batch norm on/off etc, activation function… and optimisation parameters that would help me get a feel for it.

I also sometimes have issues with understanding the info provided in PL - see this thread between me and @robertl.

I think what we get right now is pretty amazing for a 0.x release and the PL guys are really eager to make it meet users’ needs, but for that they need feedback (like this) and suggestions (if we have any) - so I have referenced below a related feature request I created a couple of weeks ago.

What would help you in this situation?

My current understanding is that

epoch loss is the average of the batch losses

You said

For example, the final validation loss reported with 2000 images and a batch size is 381k, but the loss over all epochs was reported as 200k:

Looking at the image provided, that description is consistent with reading the rh value for validation on the upper loss-in-one-epoch graph 381k for validation is the point nearly at the 400k line. Again here the loss for that epoch will be averaged over all the validation batches - the wiggly line does make it seem like temporal evolution (which it isn’t, it’s successive values), hence my feature request here.

Does that help?