Thursday, 9 May 2013

Information Freedom Fighting

My eye caught the City of York Council announcing that they publish all the "Freedom of Information" requests as PDFs ( here ).

The sharp-eyed amongst you will spot that the requests are organised by weeks. Each week's requests are stored in a PDF for that week. Each PDF would need clicking through to that week, then clicking through to that page and then downloading separately (using the handy "Download Now" link ) and then reading. The search engine is pretty hopeless and can't just return FOI requests and so gives you hundreds of results for any query.

Organising FOI requests by week is completely ridiculous, almost as ridiculous as ordering them by the number of words used or alphabetically. Now of course it probably makes sense from the point of view of compiling the requests - it sounds like a "once a week" job for somebody, but to then publish them once a week seems madness.

One of my pet hates is information that is made available but totally impossible to use. It's like saying, "Yes, of course you can have all the data we keep on you, we have written it on mist on these eggshells - would you like us to post it to you?". It's exactly like that.

So, one evening, I wondered if I could retrospectively do something more useful with their data. It is open, so why not.

First I made a crawler with ScraperWiki ( what an excellent tool this is ) that follows all those links and grabs the text from the PDFs. I based this on Martin's Scraper ( thanks Martin ).

I then downloaded the data collected as CSV and tried using the free statistical tool Sci2 for stemming and combining the words. Whilst this approach worked, I didn't like the resulting stemmed words, like glaz, because they just look so unfriendly.

Next, I wrote a python script to strip out stopwords like "the" and "where" etc and count the word popularity of each word in the downloaded file,  keeping track of the URL that it came from, and saved it back to a new .csv file.

from string import *
import  re, HTMLParser
def get_urls(word):
    f = '/Users/tomsmith/Downloads/pdfextractor_1.csv'
    lines = open(f).readlines()
    urls =[]
    for line in lines:    
        url = line.split(",")[0].strip()
        text = lower(line.split(",")[1].strip())
        words = text.split(" ")
        if word in words:
            urls.append( url )
    return urls
class MLStripper(HTMLParser.HTMLParser):
    def __init__(self):
        self.fed = []
    def handle_data(self, d):
    def get_fed_data(self):
        return ''.join(self.fed)
def strip_tags(html):
    #Warning this does all including script and javascript
    x = MLStripper()
    return x.get_fed_data()
def match(s, reg):
    p = re.compile(reg, re.IGNORECASE| re.DOTALL)
    results = p.findall(s)
    return results
f = '/Users/tomsmith/Downloads/pdfextractor_1.csv'
lines = open(f).readlines()
stopwords = open( 'stopwords-en.txt').read().split()
d = {}
words = []
for line in lines:    
    url = line.split(",")[0].strip()
    text = line.split(",")[1].strip()
    words = text.split(" ")
    for word in words:
        word = lower( word )
        word = match(word, "[a-z]*")[0]
        print word
            word = ''
            int( word )
            word = ''
        if word != '' and word != '"\r\n' and word !='"' and len(word) > 1:
            if word not in stopwords:
                print word
                    d[word] += 1
                    d[word] = 1
finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True)
out = open("tagcloud.csv", 'w')
for item in finalFreq:
    urls = get_urls( item[0] )
    urls_str = "|".join(urls)
    out.write(item[0] + "," + str(item[1]) + "," + urls_str + "\r" ) 
    print item[0], item[1], urls_str

I then uploaded it as a Google Spreadsheet and turned it into a Web Application.

Problems along the way

I found that displaying 1,000 words a bit of struggle for jQuery ( maybe I made it wrong ) so ended just showing 250 at a time.

I found, of course, that the tag clouding the word by popularity only revealed that the most popular words were fairly meaningless in this context like"foi" and "february" and "council". To do this properly you also need a list of contextual stopwords... some human intervention.

Another glitch was that occassionally, a FOI request would comprise of massive tables of suppliers ( in a PDF remember ) which would skew the words to be "B&Q" and "Wickes" or "building materials" etc. I had to avoid those.

The result

And here it is. A list of words that you can browse and find direct links to the PDFs from which it came. In the end, it seems that showing 250ish words is easier on the technology and the eye. The result is something that you could maybe browse and find a link to something relevant to you.

No comments:

Post a Comment