My attention was drawn to the City of York Council who publish their payments to suppliers for 2012 as 'Open Data' as a collection of comma separated files ( CSVs ) which you can then import into, god forbid Excel or Google Spreadsheets.
That's very nice except, the CSV files are split into ten files. They are presumably month files ( I wonder where the other ones are?). Also, I found that the columns weren't regular - meaning that in some files, the Amount was in column 7, and column 8 in others. There is also a lot of repetitive data in the spreadsheets, making them quite big to work with. All of this makes it difficult to browse and combine the data. It's almost as if they really don't want you to read and understand it.
So I thought I'd share how I coaxed it into something more useful in terms of understanding the data. The City of York Council are of course free to do something like this if they like, it only takes a few minutes - or they could pay me a consultancy fee to help them and I might make an appearance in next year's CSVs.
This is tricky. I found the easiest way to do this was download the files "by hand" and then to write a bit of python code to merge the data I needed into one CSV file. The code is here...
concatenate_csvs.py
It creates a file called "combined.csv".
You'll notice that I only used three columns of the data and grabbed them by name ( using csv.DictReader ). You can change these values to be different columns if you want.
There is also a wonderful tool called Google Refine which fantastic for cleaning up slightly duff data. It's often the case that the thing that trips you up is data that you discover to be a bit iffy and Google Refine helps you to do some very fancy manipulations.
Google Fusion tables are designed for bigger data collections that are typically more numerical. They are great at then summarising that data easily and quickly. It even can do charts of your quick and dirty summarisations. If I'm honest, my abilities in Fusion tables are poor, but I do seem to be able to muddle through well enough.
Next, upload your data. You can add details about who owns the data and where you got it from along the way.
Once it has uploaded and converted, which can take a while, you can browse your data in its raw format.
Then comes the clever bit, which is where you can create a Summary, like this...
... which give you this.
You can argue amongst yourselves about whether or not York City Council have deliberately obfuscated their expenses by providing such crappy files in such an unhelpful way. It's usually my default to blame lack of resources, knowledge and general incompetence before corruption, but one good thing is that it getting easier for everyone, even me, to be able to grab the data that's given and get it into a format where I can at least begin to explore it.
So can you. The data is online here.
The next stage needs to be about making this data, now easily browseable, more communicative. Most of the items in the list raise more (good) questions than they answer.
Why does CYC spend a million quid a year on software licences, is that value for money considering they can barely work Excel?
Were CYC really providing open information, this data would be information that questions could be asked of. There'd be links to background information to explain exactly why nearly £2 million was spent on taxis alone (that one always catches the eye ). Last year when I also made a fusion table of the Council's expenses, @jmalexander1982 happily provided explanations of what the more immediately surprising figures were about.
Of course, York City Council might distill some of this information into infographics or charts that better communicated how well our money was being spent. Ideally, this would be interactive so that we couldn't accuse the Council of spin and manipulation, creating our own interpretations and charts of the data.
Fusion tables are great for summarising data, revealing the headlines, but less could at surfacing the interesting things at the lower end of the scale. The long tail of payments around £1,000. Ideally I'd like to throw in all the directors of the companies listed in the expenses and see what connections popped out ( if any ) and make a York-centric TheyRule. Maybe if I can get a startup grant from the council, I'll do that next year.
That's very nice except, the CSV files are split into ten files. They are presumably month files ( I wonder where the other ones are?). Also, I found that the columns weren't regular - meaning that in some files, the Amount was in column 7, and column 8 in others. There is also a lot of repetitive data in the spreadsheets, making them quite big to work with. All of this makes it difficult to browse and combine the data. It's almost as if they really don't want you to read and understand it.
So I thought I'd share how I coaxed it into something more useful in terms of understanding the data. The City of York Council are of course free to do something like this if they like, it only takes a few minutes - or they could pay me a consultancy fee to help them and I might make an appearance in next year's CSVs.
Step One - Download and Combine The Spreadsheets
This is tricky. I found the easiest way to do this was download the files "by hand" and then to write a bit of python code to merge the data I needed into one CSV file. The code is here...
concatenate_csvs.py
It creates a file called "combined.csv".
You'll notice that I only used three columns of the data and grabbed them by name ( using csv.DictReader ). You can change these values to be different columns if you want.
There is also a wonderful tool called Google Refine which fantastic for cleaning up slightly duff data. It's often the case that the thing that trips you up is data that you discover to be a bit iffy and Google Refine helps you to do some very fancy manipulations.
Step Two - Uploading To Google Fusion Table
I could have uploaded this data into a Google Spreadsheet, but spreadsheets have a limit of 400,000 cells. And so with 30,000 rows, it's easy with 10 columns to start hitting that limit very quickly.Google Fusion tables are designed for bigger data collections that are typically more numerical. They are great at then summarising that data easily and quickly. It even can do charts of your quick and dirty summarisations. If I'm honest, my abilities in Fusion tables are poor, but I do seem to be able to muddle through well enough.
Once it has uploaded and converted, which can take a while, you can browse your data in its raw format.
Step Three- Summarise Your Data
Then comes the clever bit, which is where you can create a Summary, like this...
... which give you this.
You can argue amongst yourselves about whether or not York City Council have deliberately obfuscated their expenses by providing such crappy files in such an unhelpful way. It's usually my default to blame lack of resources, knowledge and general incompetence before corruption, but one good thing is that it getting easier for everyone, even me, to be able to grab the data that's given and get it into a format where I can at least begin to explore it.
So can you. The data is online here.
And Beyond...
The next stage needs to be about making this data, now easily browseable, more communicative. Most of the items in the list raise more (good) questions than they answer.
Why does CYC spend a million quid a year on software licences, is that value for money considering they can barely work Excel?
Were CYC really providing open information, this data would be information that questions could be asked of. There'd be links to background information to explain exactly why nearly £2 million was spent on taxis alone (that one always catches the eye ). Last year when I also made a fusion table of the Council's expenses, @jmalexander1982 happily provided explanations of what the more immediately surprising figures were about.
Of course, York City Council might distill some of this information into infographics or charts that better communicated how well our money was being spent. Ideally, this would be interactive so that we couldn't accuse the Council of spin and manipulation, creating our own interpretations and charts of the data.
Fusion tables are great for summarising data, revealing the headlines, but less could at surfacing the interesting things at the lower end of the scale. The long tail of payments around £1,000. Ideally I'd like to throw in all the directors of the companies listed in the expenses and see what connections popped out ( if any ) and make a York-centric TheyRule. Maybe if I can get a startup grant from the council, I'll do that next year.
Comments
Post a Comment