Ability to choose/override the data type or notify user

RBR · 31 October 2021 20:15

I noticed when adding a dataset/csv that an auto-detection takes place where if an unsupported image format is found it removes the image datatype option from the input dropdown.

This became somewhat of an ordeal because I didnt know what was going on from first launch with no info as to why the option didn’t exist and could only choose “text”. Ultimately the problem was from the dataset being contaminated with .gifs.

Possible solutions…

Notify the user of unsupported format and grey out the datatype.
Allow the user to override the datatype and pass over unsupported files.

robertl · 1 November 2021 09:05

Hi @RBR,

Thanks for the suggested solutions!
I believe it currently says which formats are supported, it can however be made much clearer:

We’ll look at adding something along the lines of your first suggestion in the near future, or at least improve on the current message

RBR · 1 November 2021 16:26

Yes it does state what’s supported, The thing is I just assumed my dataset was clean which ended up having 4 gifs in 90k of files and that’s where it became a problem because I didn’t know nor did it tell me I had them. I ended up doing a lot of trial and error just to figure out what was going on and having a dialogue box thrown up stating “something doesn’t quite belong here” would have saved me quite a bit of time.

JulianSMoore · 2 November 2021 14:00

Nice observations @RBR - I think there have been similar requests for insight into data quality/exceptions. It can be hard - esp in your case, you have 90k files - to find exceptions because: how do you easily search for e.g. “NOT JPEG”?

Going back to your image classification task - is identifying NSFW the sort of thing where False Negatives are more of a problem than False Positives? Just wondering how much being able to weight things would be helpful in this kind of task… in medical classifications, False Negative on disease recognition should probably be penalised heavily: let more positives through to human review is usually the way to go that case.

RBR · 2 November 2021 17:10

@JulianSMoore
Well I ended up in the position because the most of the scripts for formatting the images and the trainer I wrote were working on one extension. When I came around to try perceptilabs I did a script for making a csv and that’s how they got in being it didn’t scrutinize file types, not sure how they originally got in the directories but I know how I got here .

90k is just a test bed to see what will and won’t work and to play with and see how perceptilabs works. From some reading on how similar prebuilts were made 600k seems the average and something to aim for but im not gonna wait a month just to test something till I know everything is going in the right direction.

Im still in the process of just getting PL off the ground with my own data so I haven’t gotten that far to properly answer you, Im still learning here . But from playing with prebuilts both are a problem and the one that did document their weighting it does help. It will never be 100% but I can get close the problem im trying to solve is a volume one not an automation one. It’s just to be a bot as a set of eyes that sifts content and says “hey look at this” not make moderation decisions so I can afford to have a large margin of error with no penalties all it has to do is be more right than wrong for now. I also try to get the predictor to deal in absolutes where in order to throw up a flag NSFW has to be greater than a neutral class and hit above an 80% threshold (adjustable) which may or may not be a good approach but made sense at the time to have it not give out maybes/halfsies.

robertl · 3 November 2021 08:51

@RBR
Using a threshold on how confident the model is on the answer sounds like a good approach
I believe a ROC curve does something similar where you can see what effects different thresholds has on your models predictions.
You might also want to look at the Recall of your model besides the Accuracy, as a measurement of how many NSFW images it actually finds (although it does not take false positives into account).

This feature might be of interest for you as well: https://perceptilabs.canny.io/feature-requests/p/dont-stop-the-data-loading-in-case-of-errors

JulianSMoore · 3 November 2021 09:10

@RBR

not sure how they originally got in the directories but I know how I got here

oh yeah, been there! Reminds me of a wonderful bug in a piece of cleansing code I wrote once: try debugging the main code when you’ve already persuaded yourself the data is necessarily good!

On the upside - that does in fact show a side benefit of driving PL with CSV (maybe something for a tips-n-tricks section hereabouts?) - by using Excel autofilter one can very quickly scan columns for anomalous entries

NSFW has to be greater than a neutral class and hit above an 80% threshold (adjustable)

Got it; use the model to give you a heads-up - makes sense

Let us know how it goes - I think it’s an interestingly challenging classification task and what you learn by doing could save someone else a lot of effort.