Tuesday, 24 April 2012

Scraping The Festival of Ideas, June 2012

I noticed something on Twitter about the University's Festival of Ideas and thought I'd take a look at the events listing. Not long ago, the Web Office used to put microformat information in web pages so that I could easily add events to my calendar... Either they've stopped doing that, or it's stopped working, so I thought how easy would it be to grab the events listed and add them to my (or a separate calendar).

In order to do this, I'd need to...

  1. Scrape the HTML from the web page and find the event data
  2. Connect to Google Calendar and add the events found

Because I like programming in python, the first thing I did was to go get the latest copy of BeautifulSoup, which is a library that is unbelievably handy for scraping data out of HTML and also Google GData which lets me talk to Google Calendar.

I so I began...

import urllib, urlparse, gdata, time, datetime
from bs4 import BeautifulSoup
import atom
import gdata.calendar
import gdata.calendar.service

... and loaded the libraries.  Then I connected to Google Calendar, like this...

print "Connecting to Google Calendar"
calendar_service = gdata.calendar.service.CalendarService()
calendar_service.email = '*********@york.ac.uk'
calendar_service.password = '**********'
calendar_service.source = 'Google-Calendar_Python_Sample-1.0'

 .... then got the web page with the Festival of Ideas events on it like this...

url = 'http://yorkfestivalofideas.com/talks/'
print "reading ", url
u = urllib.urlopen( url )
html = u.read()

... At this point, I knew I wanted to create a separate calendar, so I made one in Google Calendar ( IMPORTANT! Set the timezone of your newly created calendar!!! ). Once I'd done this, I could then find what's called the calendar link which you use to specify which calendar you want events to go into...

def get_my_calendars_url(cal_name):
feed = calendar_service.GetOwnCalendarsFeed()
for i, a_calendar in enumerate(feed.entry):
name = a_calendar.title.text
print i, a_calendar.title.text, a_calendar.link[0].href
if name == cal_name:
return a_calendar.link[0].href

calendar_link = get_my_calendars_url("Festival of Ideas")

So, now I have some HTML with useful information in it and a way of connecting to my chosen calendar... I need to use Beautiful soup to fish out the data I need.  I begin like this...

soup = BeautifulSoup( html )
events = soup.find_all("div", {'class':'event'})

... Now the HTML has been turned into a "soup" which means I can do fancy things with it... like the 2nd line above where I grab any DIV that is of class "event" from code that looks like this..

<div class="event">
<div class="eventdate">
<div class="day">
<div class="date">
<div class="month">
<div class="eventdetails">
<p class="eventtitle">
<a href="/talks/2012/frenck/">
Where it all began: The Big Bang
<p class="eventteaser">
Professor Carlos Frenk will open this year's York Festival of Ideas with a talk on the biggest metamorphosis of all - that of the universe as a whole, from the simplicity of the Big Bang to the complexity of the universe of galaxies, stars, and the planet on which we live.
<div class="clear"></div>

...Once I've got a list of events I can then do this... which finds the title, and the text and the dates and times of the events....

for event in events:
title = event.find('p', {'class':'eventtitle'}).find('a').contents[0].strip()
href = event.find('p', {'class':'eventtitle'}).find('a')['href']
href = urlparse.urljoin(url, href)

#Get the actual page in the href!
u = urllib.urlopen( href )
event_html = u.read()
small_soup = BeautifulSoup(event_html)
start_time = small_soup.find('abbr', {'class':'dtstart'})['title']
st = time.strptime(start_time, "%Y-%m-%dT%H:%M")
end_dt = datetime.datetime(2012, st.tm_mon, st.tm_mday, st.tm_hour+2, 0, 0)
end_time = end_dt.strftime("%Y-%m-%dT%H:%M:%S")
start_time = start_time + ":00" #HACK UG!

teaser = event.find('p', {'class':'eventteaser'}).contents[0].strip()
teaser =  teaser + "\n\n" + href

print "creating event:", title
print create_event(title, teaser, "York, UK", start_time, end_time) 

print "_" * 80
except Exception, err:
print err

.... and the create_event code, which uses that calendar_link mentioned earlier, is...

def create_event( title='A lovely event', 
    content='Some text about it', 
    where='York, UK', start_time=None, end_time=None):

    event = gdata.calendar.CalendarEventEntry()
    event.title = atom.Title(text=title)
    event.content = atom.Content(text=content)
    #time_zone = 'Europe/London'
    #event.timezone = gdata.calendar.data.TimeZoneProperty(value=time_zone)

    if start_time is None:
      # Use current time for the start_time and have the event last 1 hour
      start_time = time.strftime('%Y-%m-%dT%H:%M:%S.000Z', time.gmtime())
      end_time = time.strftime('%Y-%m-%dT%H:%M:%S.000Z', time.gmtime(time.time() + 3600))
    event.when.append(gdata.calendar.When(start_time=start_time, end_time=end_time))

    new_event = calendar_service.InsertEvent(event, calendar_link)

    return new_event

... Putting it all together I got a events that can be displayed in a fairly rubbishy widget ( go to June 2012 to see the events! ) or a calendar that anyone can browse here.


The End Result?

To be honest, presentation isn't Google Calendar's strongpoint is it? It's fugly. It's all about the utility though... and I suppose making sure you get to those events.

I guess my point was, and is, that more of this sort of data should be ending up in places that I can use it, i.e in Google Calendar rather than hiding on a web page somewhere. Maybe this little bit of code will help someone to get their events in a more usable form.


  1. This comment has been removed by the author.

  2. Looks nice, I have had a quick play with Yahoo Pipes to produce a RSS feed and simple image gallery from Festival of Ideas URL


    I think with a little bit more work could create ical feeds