The Icelandic government recently opened a new website: opingogn.is (can be translated as opendata.is). This actually just points to another site (as new but not as revealing name): gogn.island.is (translated: data.iceland.is). The website runs on CKAN from the Open Knowledge Foundation (my new workplace) and its purpose is to open up government data. This is an awesome undertaking of the Icelandic government. Congratulations!
The first three datasets on this new data portal were financial data for the year 2012:
Well this was just perfect for me since in my work for the Open Knowledge Foundation I'll be working on OpenSpending, a project that analyses financial data and helps users present it beautifully. One of the key elements for a successful open source software project is developers who are also users instead of developers and users being two distinct groups. From this we get the phrase: scratching a personal itch. So I decided to become a user of OpenSpending, the software I'm supposed to be a core developer for, by uploading these three datasets.
I also decided to document the upload process so other could follow it. Here it goes...
First thing to notice is the map on Openspending's frontpage. It shows all of the countries and paints countries with published datasets in green. The more intense the color the more datasets have been published. So there are more datsets published from the United States than there are from Canada.
Another thing to notice is the white color of Iceland. In the United States kids learn to say:
"Iceland is green, Greenland is ice."
Well, in this case it's wrong. Both of them look like they're covered with ice. Let's change that.
First I clicked on datasets in the menu bar and got a list of existing published datasets and a button that invited me to Import a dataset.
Clicking on this Import a dataset button told me I had to register before I could continue, so I registered on OpenSpending.
With my spiffy new user account I went back to the dataset list, clicked the button again and got this nice form I could fill in. First up: title. Here, with my internal knowledge of OpenSpending I knew I could add many sources to a dataset. The newly released datasets were tied to 2012. I therefore decided to only create a dataset that wasn't bound to any year. When the Icelandic government releases more financial data I can just add them to the existing datasets.
So working with the quarterly reports for 2012 I decided to name my dataset Gotvernment financial reports (uppgjör ríkissjóðs). The dataset will therefore be time neutral (and the time w.
Next up identifier. What's that? It must be how we want the dataset to be referenced, what I've always known as a slug. So I fill in my homemade slug (uppgjor_rikissjods).
Category was next. Here I could choose between expenditure, budget, or other. Since this is actual money being distributed from minstries to agencies I decided to put this under expenditure.
The remaining fields were pretty self-explanatory. Iceland is the country, Icelandic is the language and description is a short text about the dataset.
After creating the dataset I got a screen where I was invited to add a source which I did. So this is where I would put all of the four sources (for the four quarters of 2012) and OpenSpending would then analyse them all together.
I clicked add source which asks for a URL to a CSV file. I went back to opingogn.is and found the first quarterly report. I had to take care to give the URL to the actual CSV file and not some intermediate page.
It's pretty easy in CKAN. When you click on a button to view the file it gives you two options: View the file (the intermediate page) or download the file (the CSV file). On the page where you can view the file there's also a download button (pointing to the same CSV file).
So I got that URL and fed it into OpenSpending.
After providing the CSV file OpenSpending invited me to create a model. If I hadn't gotten a good introduction when I went to Cambridge on the job interview I would have been pretty lost at this stage. I would have had to read up on and understand how OpenSpending stores data.
It's not complicated but understanding dimensions and measures and whatnot might be too much for people who just want to get data into OpenSpending.
Basically, dimension is something that provides context to the data while measures are the numerical value we're presenting. Dimensions can either be attributes (a dimension that's similar to a measure, just a single value) or a compound (a mix of more than one attributes).
In OpenSpending there are two special types of measures: Date and Amount. Both of these are required. No other information is required by OpenSpending (since they just provide context to money (amount) being spent at a specific time (date).
When creating the model I noticed a problem with the data. There was no time column and since there is no time column I wasn't able to create one of two mandatory fields for the model. This would require some data wrangling and that's not what OpenSpending is for.
So I hacked together a script to wrangle the data into what I wanted (this is the E and T part of ETL). I ended up having two script for these three datasets but I'm only focusing on one of the datasets. I've created a repository for these scripts on Gitorious and I push them to GitHub as well to be super-social. Not that you can't be social on Gitorious, (I like Gitorious more than GitHub) but it's easy to push to two repositories so I can be where both types of developers are.
After downloading and passing the dataset into my script I had a new file, a modified version of the released dataset more suitable for OpenSpending. Now I just had to figure out how to get it into OpenSpending.
One of the first ideas I had was to use Pagekite to make the files on my computer accessible. I've previously used that successfully instead of USB sticks or other methods of media delivery. I love Pagekite but I was afraid that making the files temporarily available wouldn't be good for OpenSpending. I felt having a more permanent storage for the files would be better.
I had heard about The Data Hub while I was in Cambridge. A place that's supposed to be the GitHub for data. I therefore decided to use that to store my modified data files.
Putting the data up on the Data Hub really felt like an extra step. I had to fill out much of the same information as I had done on OpenSpending: Title, Description (where I put the license clause again). I even had to make the same decision again with the slug (this time I almost bound it to a specific year but decided against it at the nick of time).
One thing I liked about the Data Hub was that I was able to make my dataset a part of a group. I was able to add my dataset many other similar datasets. Afterwards I found out that this doesn't make it any more discoverable. First I would have to find the group and then I'd have to find my dataset. It's a feature that looks nice but I don't think I'll be using it in the future (since I'll most likely only use the Data Hub as a permanent storage and not as a social platform).
Alright. After uploading the four modified quarterly reports to the Data Hub I tried importing the data again. Now, the perfectionist me didn't like having a lingering unusable source in my dataset so I deleted the whole dataset and created a new one. This is perhaps pretty radical and not what normal people would do. However I just want things I'm maintaining to be cruft-free.
I used the same information as before but used the opportunity to modifiy my slug (the identifier) to use a dash, not an underscore, to represent the whitespace. Not a huge change but I like the dash more than the underscore in URLs.
After creating the new, cruft-free dataset I went straight to uploading the four sources I had for the quarterly reports. Now to get the correct URLs I had to go again through some hoops.
On the Data Hub page for my dataset I had to choose the resource I wanted to upload. Now this isn't as easy as the CKAN version on opingogn.is (which is running CKAN version 2 but the Data Hub is running CKAN version 1.8).
I had to first click on the resource to get an intermediary page. On that page there are two ways I know of to get the data url. Neither of them is really obvious. The more obvious way is to just copy the URL which is provided in the additional information section (the metadata table). The other way (which should be obvious and probably is for some) is to right click on the button that says Download and the copy link. It's not entirely obvious that the button is in fact a link but when you know (or guess) it's easier than scrolling down, highlighting the url and copying it.
With that URL copied I pasted it into the source field of OpenSpending's Create new source (yes, weird to call it that when you're just adding it, but the button to trigger it says Add a Source so it isn't that much of a biggie).
Now when I went to the model editor and clicked on Add Dimension I got lots of interesting fields (columns) to add as dimensions.
This is a relatively straightforward model editor but I can't help being irritated by how wrongly it tries to guess what field to map in the model (it probably isn't guessing just taking the first option). For example when I wanted to add the amount measurement it suggested I use ministry as the field. It's not that difficult to change or map the correct field but I got this feeling that the model editor was one of those childhood know-it-alls who pretended they knew what to do when they actually didn't (and if you'd trust them you'd end up in a worse position than if you wouldn't have).
That's not to say the model editor is bad. It's good and pretty simple to use. It's probably just that after taking all of those previous steps to get back here I was slightly more negative than I should have been.
I ended up creating four dimensions/measures. The required amount and time (which got labeled upphæð (amount) and lokadagsetning ársfjórðungs (last date of the quarter) respectively. I also created a from and a to field. From was the ministry (ráðuneyti) giving out the money and to was the agency (stofnun) receiving the money.
For unique key I used three dimensions:
If I would uncovers some juicy spending irregularities that would bring the Icelandic government to its knees I'd call the four dimensions the four horsemen.
With my model and my modified data sources I was ready to test a sample. I gathered that it would run a few of the lines against the model to verify everything works as it should.
I click on the test a sample button and wait with crossed fingers. Nothing happens. Oh, yeah I need to refresh the page. Refresh. After running through a sample of the file and testing it against my awesome four dimensionsal model I got a success.
In a cloud of joy with no errors from my sample test I was certain I was now ready to load the actual data into OpenSpending and get the analysis starting.
I click on Load and double cross my fingers.
After a short time of loading I felt my data would be ready so I refreshed. Alas! There were errors. Something had gone horribly, horribly wrong. All my work was gone, down the drain. It was back to the drawing board for me.
I looked at the errors I got. The message was:
"The number of entries loaded is smaller than the number of source rows read."
The error itself was:
"440 rows were read, but only 0 entries created. Check the unique key criteria, entries seem to overlap."
What?! My assumption that the data shows how much money a given ministry sent to a given agency in a given month seemed to be wrong. This was perhaps transactional data, not consolidated transactions. It felt a bit weird since all government transactions for all ministries to a all agencies are definitely more than the 440 rows in the data.
I trust the government more to do a good job than me so there must be something wrong with my script. To investigate I wanted to see which unique keys had clashed. I hacked together a small script than ran through my modified data files and printed out all unique key clashes. I ran the script and got no clashes. What?!
OpenSpending has problems with the unique key criteria but there are no clashes. What does OpenSpending mean? How does it evaluate the unique keys. Am I missing something? I deleted the loaded data and decided to go and sleep on this.
The next day I sat down again. How on earth was I supposed to get the data into OpenSpending when I'm getting an error I can't even begin to imagine how to solve. Sleep didn't help me find the magical solution. All this work for nothing. Well I couldn't just give up.
For some odd reason I decided to click on the load button again. I don't know why. Maybe I did it because I secretly hoped computers weren't logical things and were able to fix themselves. Maybe I did it just to increase my unhappiness by getting more errors as some sort of a self-inflicting data loading pain. I think I did it just to be absolutely sure I had clicked the right button the day before.
Whatever the reason was I clicked Load. Stared into the abyss for a moment. Refreshed. It worked. Wait, what?! It worked?! Computers are magical beings that can heal themselves! I now had all of the data loaded into OpenSpending, ready to be made public to the world. Here is how the Icelandic government spends our money!
I later found out what had happened and why I got an error. Computers aren't magical self-healing objects. They just do what they're told. When testing a sample OpenSpending actually loads the first thousand lines into the database. You're not testing the data you're actually loading it. In the case of the data I had (with the 440 lines) I had actually loaded all of the data into OpenSpending. That meant that when I had pressed Load every single line clashed on unique key criteria because every single line was already in the database. When I had deleted the loaded data before going to sleep I had removed those lines and made pressing Load the next day possible.
This is not what I had anticipated at the time. It's just weird behaviour, although I'm glad it worked out for me. I just wonder how many data loaders have hit this problem before and given up.
A bit uncertain whether things had actually worked or if OpenSpending was just making fun of me I followed Nike's advice and decided to Just Do It! (I'm probably breaking trademark laws right now).
I pushed the button to make the dataset public (the one between Delete loaded data and Delete entire dataset - don't push the wrong button).
My dataset was now live and available to everybody.
Let's take a look at that that dataset. It is live on openspending.org/uppgjor-rikissjods and everything gets presented beautifully in a nice table. It's not perfect. My date dimension that I added as the last date of the quarter (and is labeled as such) only shows the year (not the month). I can however filter by the last date of the quarter so it appears to be only a presentation issue (and after checking, I know it is).
Let's create a visualisation from this data. Let's see what ministry is paying the most amount of money.
We can even make it slightly more complex and show it from the agency perspective (what agency gets the most money). This isn't as good a visualisation because the sheer number of agencies makes this nothing more than a bunch of differently colored boxes apart from a few big receivers. Maybe some sort of a Pareto based visualisation might be interesting. Where does about 80% of the money go? Does it go to only 20% of the agencies? How far from the Pareto principle are we?
This was my journey. I'm now a proud user of OpenSpending (and I know of a few places where I can and have fixed some usability issues).
I've fulfilled the phrase from the United States:
"Iceland is green, Greenland is ice."
Well it's true until Greenland decides to release their financial data and somebody puts it into OpenSpending (hopefully sooner than later). Iceland is now at least a green blip on the screen of visitors to OpenSpending.
I can't wait until the Icelandic government releases more data!