In this blog series I will document my work on the Kaggle ACEA Smart Water Analytics Competition. The purpose is of these blogs is not to share perfect code or completely worked out solutions. Rather I want to take you along for the ride and share my insights as is. This blog series was initially shared on my personal website, www.codexvalue.com, below are the links to the 4 distinct steps in this project:
- Challenge goals and initial discussion
- Data setup for modelling
- Data modelling
- Refinements and final insights
If you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview. You can download the dataset for free after signing up.
The challenge: ACEA Smart Water Analytics Competition
The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model. The goal is to accurately assess the water level of tomorrow, based on data of today. This specifically is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here. I will work through some code excerpts and interesting highlights in these blogs.
So far we have discussed the challenge more generally, looked at some data wrangling and new features for modelling. Last time we overcame a classic issue with time series modelling concerning our cross-validation. This week we will finish off our modelling and how we finish off the challenge for handin.
ACEA Smart Water Analytics Competition Final model
Last blog was a more indepth discussion of hindsight bias in the cross-validation stage of the modelling. By using specific a specific time series method we stabilized our model. In the week after I stumbled upon a work from Hanna Meyer. She is a machine learning expert in the field of environmental data. There is a great research paper she wrote available here (https://www.sciencedirect.com/science/article/abs/pii/S1364815217310976?via%3Dihub)
Studying her work made me realise that the Kaggle dataset has all the characteristics she discussed in her paper. She discusses indepth both the time series aspect covered by Hyndman, but also the spatial element of the dataset. In this case having multiple measurement points in the dataset that measure the water level. The main advantage of setting up the model in this way is that it is more sensitive to unknown locations within the given water body. On top of that it allows for simultaneous modelling of all given locations.
Leave Location and Time Out (LLTO) modelling
So in the final model I introduce both a spatial element and a time series element in my model. Methodologically this is called LLTO (Leave Location and Time Out) Cross validation. Essentially the idea combines all earlier discussed steps.
In your training set you leave out the location that is in your validation dataset, this covers the spatial element. You then only include timepoints from before the time series in the validation dataset, this covers the time element. If you have 4 locations you have 4 folds per timeperiod. Each time one of the locations is placed in the validation dataset. This method is implemented in the CAST package in R. However I found that the time aspect is actually not handled properly in the relevant function (CreateSpaceTimeFolds). Hence I ended up making my own folds that respect both aspects.
In the code below there is an example function that handles all these steps. Specifically of interest may be lines 47–101, this is where the handmade folds in the cross-validation are created. I have not perfected this code, but it shows the main steps accordingly. If you want to know more about this or discuss then dont hesitate to contact me. We might get back to that later in a different blog.
ACEA Smart Water Analytics Competition; final thoughts
In doing this challenge I’ve ended up putting a lot of focus on the modelling stage. I did learn a lot going through all these steps on different data then I usually work on. When I look over the final model I am happy with the outcome of the project. The main reason is that it can generalize well to new locations of water level measurement points and is robustly designed for time series effects. Overall I feel that hints back to the original spirit of the challenge.
There are some improvements to be made, for example the model failed on some data sets due to lack of usable data. I could fix this by looking at imputation methods, an area I have skipped completely during the project. I might revisit that later as needed, as the dataset provided in the ACEA Smart Water Analytics Competition contains alot of missing data points.
Furthermore the true applicability of this model is still to be determined. I made some assumptions, namely that we want to predict next-day measurements, when ACEA might be interested in multiple steps ahead predictions. The model is also more generalized, this results in easier applicability on new and unknown datasets, as well as new locations. But from a business standpoint it can also be logical to focus on optimizing currently known locations more thoroughly. It definitely interests me how subtle changes in the data setup and modelling approach can lead to completely different use cases from a business perspective.
Its been fun and I will be back writing about the winning notebook, comparing it to my own findings.