Input or target

crgone1 · 15 September 2021 15:52

Hello,

I really love what has been built! I have a question about loading data with hundreds of features. Is there a way to select every row at once to set to input or target? It seems it would be a very slow process to manually click each feature to select one. Am I missing something?

JulianSMoore · 16 September 2021 09:37

Hi @crgone1!

I’m not sure what’s happening but there seems to be a steady stream of new faces around here - the more the merrier!

To answer your question, there are a couple of issues with the datawizard that make working with large number of features problematic - I believe PL is dealing with them now or soon (I don’t have hundreds, but I have ~20-30 by the time categories are counted).

I agree it would be painful to specify them all by hand… if you have any ideas about how it could work better there’s a feature request forum here and an external list at https://perceptilabs.canny.io/feature-requests. All contributions are very well received by the PL guys!

That said - and it’s just my personal opinion - “hundreds” of features sounds like a very strange or even poorly specified problem… care to say a bit more about what you are aiming for?

crgone1 · 17 September 2021 09:10

Hello,

Well, I have not started to work on what I’m doing next, but I was using data models I built in the past. I spent a lot of time with ECG data. I was using electrical signals, invasive BP data and more to model the depolarization vs polarization stages of the heart to get predict outputs that normally require invasive procedures. These models became quite complex. To get the data I had to understand what kind of data to extract during signal processing. Long story short, I have many models based on what I was training, some with different arrangement of features.

Another model(s) was for predicting prices in the market. Could not do what I want here, because I needed millions of rows and millions of features. Don’t have the set up to train such a model for testing. Overall, when I think of time series data, I think of history. History is something that is a must to understand where the future leads. So when I build models where without time, there would be no data, I shift past data into a signal instance, which then become features. Understanding how far back to go in time is important based on the problem.

Absolutely on over fitting. But when I work with data that is in principal, apart of chaos theory, random and sensitive to time, I model time into the features. It has served me well for the type of projects I like to do.

JulianSMoore · 17 September 2021 13:06

Hi @crgone1

The heart modelling sounds very sophisticated. I don’t know how one even begins to do such a thing so I’d love to hear more.

Maybe @robertl will have some thoughts on how models with more features fit in.

crgone1 · 19 September 2021 03:20

Hi Robert,

What would like like to know more of? I happy to share anything.

Robert did reach out. I will be joining the slack channel as well.

JulianSMoore · 20 September 2021 09:47

Hi @crgone1 - assuming the opening question was for me, I’d love to know what the literature is on ECG and polarisation mapping?

I’ve included some guesses about this below, but only you can really explain what the objective is and how this modelling will help. But, as the community grows it becomes more and more likely that somebody here knows something that could help you along, so the more you can share the better your chances are.

Of course the PL guys will always be here to assist with technical matters, but there is a growing pool of experience among users…

Anyway. How close would I be with the following guesses on what you are trying to do?

You are using more than the standard 12-lead ECG (many sets of standard x12?)
The objective is to do something like electrical heart-surface tomography by
Using ECG signals and an electrical model to infer the real time electrical wave propagation
Hence infer muscle/heart operational details in terms of standard physiological parameters such as cardiac output, (CO), cardiac index (CI), stroke volume (SV), systemic vascular resistance (SVR), etc.

as an alternative to ultrasound doppler?

If so, what specifically is guiding the number and placement of electrodes? Is there an electrical heart/body model that would allow the reverse - to go from heart to expected electrode signal - to provide the reference?

I think this is very interesting, especially if the resolution of the surface electrical activity showed propagation differences due to damage from past myocardial infarction and/or arterial obstruction.

I’m sure I read somewhere once about surgery to improve heart performance by selectively cutting into the muscle to design an improved electrical conduction pattern… if you knew where the damage was, corrective surgery could be easier safer.

On the other hand, I may be completely off target! Over to you!

robertl · 20 September 2021 09:17

Jumping in quick here as well with the feature question, it sounds like what you are doing would greatly benefit from a Timeseries or Array datatype (seems Deep Learning is becoming more popular for processing timeseries data: https://towardsdatascience.com/deep-learning-for-time-series-data-ed410da30798).
We have those lined up in our roadmap, but will be a little bit before they get shipped.

Until then, one effective way is to convert the timeseries into images. Using spectrograms is one way to do this. This also allows you to use convolutional layers with all their benefits.

Looking forward to seeing you in the Slack!

Best,
Robert

crgone1 · 23 September 2021 09:56

Well, It’s clear you know a lot about the heart and the ECG! To be clear, this was a project I started and got very far with but put aside after realizing I could not get more data without spending a lot of money on an Investigation Device exemption study (IDE).

The first goal was to convert a lead I ECG (dry electrodes) into a real-time beat by beat invasive arterial pressure, cranial pressure, L/R ventricle, and L/R atrium pressure. I started off learning about every stage of the heart and its relative cycle stage to the ECG. Then learned about what affected blood pressure throughout the body. Then fluid dynamics, breathing, and so on.

Most people use a method called pulse transit time to calculate a single BP measurement, but it never works well enough to be considered a class II (FDA) medical device. The first thing was to understand what kind of features to extract and where to get the data. I needed invasive arterial pressure with a multiple lead wet electrode ECG recorded at the same time. I ended up getting like 100 patient’s data from a study a long time ago with that kind of high-quality data, but only like 45 patients of that data were usable. Datasets included breathing, Sp02, internal heart chamber pressure, and cranial pressure in some cases for each patient. You can imagine this kind of data is hard to get, as they were all ICU patients in critical condition.

Professionals who have used ML for BP measurements always try to do so with a blood pressure cuff and ECG data. I always thought cuffs would never work anyway because cuffs are not that accurate actually, just enough for spot-based care by FDA standards. I had learned prior, in machine learning, you need accurate data to points to what you are trying to predict to make it feasible, especially in the medical field. I wanted real-time direct measurements of everything.

Signal processing was one of the most valuable processes to make sure the data was clean. ECG can show breathing, but it’s hard to extract because it’s a part of the noise. I learned all professionals in the field use a filter called baseline wandering removal to smooth out the ECG signal before analysis. But for me, noise always has importance to any problem. The wandering from the baseline is partly caused in part by breathing and patient movement. So I figured a way to keep the important stuff while removing the rest of the noise.

After some basic filtering and cleaning of the signals, the next step was to extract features, which would represent what was needed. ECG feature extraction involved the P wave, QRS complex, and T wave. I needed the time/location of the wave start, peak, and end of each wave. Then peak amplitudes for each, followed by the area and angle of each wave. I did much hands-on analysis with data sets, which led to the understanding that the past 2 to 3 heart cycles had a high correlation to beat-by-beat arterial pressure.

So I started to use past cycle features within the same instance to build the data models to predict a real-time BP waveform. That’s when I learned shifting time series data did not just apply to this problem, but every ML problem I have played with regarding time-series data. I only focused on real-time arterial BP at first. Long story short, I got it to where it could predict the entire BP wave within 2 - 14 mmHg per heart cycle. I had limited training data, with signal processing that was sub-par of what I imagined due to limited help, so there were few cases where it could be even a little more off. I wanted to learn how to do the invasive procedure on myself to collect more BP data. I found out this would have got me into serious trouble if I did so. Including the data would have been thrown out. I could not get any VC to believe me or take the risk. The medical field can be tricky for a self-taught person with no prior experience. Doctors were impressed, but I needed more data to get them to vouch for me.

Overall, neural networks and or rule-based models did very well. Training became a problem due to the data set sizes being millions of instances with many features. That’s a lot of CPU/GPU power I did not have. So I simplified the models drastically to get results. For me, it was a success at the very least, but to medical professionals, they needed to see more patient data, which I could not provide. Plus, I did not know how to automate the processes with real-time ECG streaming, filtering, extraction, and predictions due to my skill set. So, I stopped because I spent way too much time and money to get that far. I heard NO so many times from VCs that I stopped counting. This whole project was the first time I started to learn about machine learning with no background. It took a lot of time to fully understand how it works and where it should be applied when needed.

If you want, I can share some of the bigger models I made for stock market price predictions if you like? It’s only a daily scale, so they are not as I intended, but they produce decent predictions. It’s a different problem entirely than the above. I wanted to use 15 min and tick data, but unfortunately, the models got so big my computer would crash when opening the data models, so for now, I will never know if they would work as I imagined. Overall, I build these models in my head first before trying them out, as it helps me save time. Now, I want to learn about convolutional networks to convert 2D video into 3D by predicting the second perspective. The first step will need a field of depth prediction before predicting the second perspective accurately. Upscaling and sharping of predicted images will be required as well. I see how I want to do this, but I don’t know how yet. I don’t even understand how to build these types of models yet. This is why I’m here mostly, and that I love Perceptilabs setup and simplicity a lot.

crgone1 · 23 September 2021 09:58

I’m sorry, because this message is way to long and wordy. You don’t have to read it all!

crgone1 · 23 September 2021 10:03

How do I get to the slack page?

That is a great idea to use a convolutional network with time-series data. I had thought that would be very powerful, but I had no idea how to do that, and for some reason, it felt overwhelming to think of that when I was struggling to visualize what I was doing with time-series data. I’m sure all sorts of new patterns would arise if I treated time series data as images.

JulianSMoore · 24 September 2021 13:49

Hi @crgone1

There’s a link for slack at the top of the page. Try that - if it doesn’t work (I can’t tell, I’m already there) just let me know here and I’ll try to send you and invite, if I can: I’m just an enthusiastic user like you so I might not have permissions, in which case I’ll ask @robertl to issue one

I’ll add a reply to the longer message now; the cool thing for me here is the diversity of ideas, creativity, application etc. so I love all these new possibilities and if I could, I would probably want to do the same sort of things myself/collaborate if I had the time!

JulianSMoore · 25 September 2021 03:23

HI again @crgone1

I don’t really know much about anything, but I am a very quick study - I just researched the heart, ECG etc. and put 2 + 2 together I see I wasn’t too far off - you want to predict the pressures.

You know what I’d do? I’d artificially stimulate a heart - maybe a frog for starters? - and measure your observables, then train a model to predict them from the stimulus magnitude/placement; and then I’d try to train a model to infer the stimulus etc. from the surface ECG readings… sort of make a physiological encoder-decoder, though I’d give it less than 20% chance of working! Not sure the frog would consider it a good use of its heart

The fact that you do have real data is a major plus! I hope the time series are long enough… I wonder if there’s any way to “bootstrap” more samples if you need them for training?

I love making connections, and if not for myself, then for others, so have a look at what @esbenkran is doing… I think there might be fNIRS “EEG” /ECG time series synergy here… or at least, you might benefit from sharing tips & tricks! Stick around and you’ll see a lot of that happening.

The stock market prediction work is also interesting… I have actually done some work in that area (!) though not on the neural network side: what I was doing was developing a chaos detector to feed into a prediction network. It worked (I presented at an investment conference in Germany about it a few years ago) but it should be reimplemented for efficiency now that suitable tools for doing the ML side are appearing.

TBH, I’m not a big fan of too much data - “millions of features” - I’d love to know how that can be handled

The point about it being a long message is fair! but it is really interesting… it makes me wonder about setting up some zoom calls maybe, with just a few people, to share/present models/other work etc., where you can more easily show stuff and respond to questions…

What do you think?

crgone1 · 25 September 2021 02:15

https://1drv.ms/u/s!Al1wt0aspkn8jYAnqwVwnhFJP84WsQ?e=lHoX4M

This is a one-drive link to some training models for gold vs the USD (High, Low, and Open predictions). They are like 400MB each though. Well, I’ve never trained anything with millions of features or instances, because the files always crash, like I can’t even open the documents anymore. I don’t think excel is meant for this.

The attached models have hundreds of features though. These are pretty good at predicting price, but to run these models they need real-time price feed, and connection to a doc for news, which is labeled as High impact, Low impact, and medium impact. A buddy and I got this working in real-time, but it was so hard to see what was happening when it was complete, because we did not know how to run the backtesting with the machine learning code (a must for this sort of thing), so it just had to run live. Overall, I wanted to predict other things on top of this to help the overall system, like when to place trades, and when to close them. We just ended up creating some basic logic to trade and close every hour based on the predictions.

It needed a lot of fine-tuning and custom code that we took a break from. As I needed to make the models even better, but resources were becoming limited, as this was on a 1-hour (1-hour candle data) prediction schedule, and I wanted to use 15 min data. I plan on coming back to this in the near future to fulfill the vision.

Just wanted to share this real quick. I will respond to the rest of your message later tonight:)

JulianSMoore · 26 September 2021 18:59

Thanks @crgone1

Just to see, I downloaded one of those 359MB CSVs… and much to my surprise Excel 2019 opened it…

All 895 columns and 77 920 rows!

Clearly a lot of work has gone into this but I think you might have to write a whole paper to explain the pipeline…

If you want to do more work on this then Python is probably a much better friend than Excel… so much easier with pandas (read_csv, etc.), numpy & matplotlib (or plotly?) to wrangle this. And you could have all 3 datasets in memory at once! (And maybe just pickle your dataframes for speed of reload between work sessions?)

There’s also some package that does transparent streaming to/from disk for large dataframes but I can’t remember what it’s called.

I’d hate to try plotting, interacting with all that in Excel!