Skip to main content

Scraping The Festival of Ideas, June 2012

I noticed something on Twitter about the University's Festival of Ideas and thought I'd take a look at the events listing. Not long ago, the Web Office used to put microformat information in web pages so that I could easily add events to my calendar... Either they've stopped doing that, or it's stopped working, so I thought how easy would it be to grab the events listed and add them to my (or a separate calendar).

In order to do this, I'd need to...

  1. Scrape the HTML from the web page and find the event data
  2. Connect to Google Calendar and add the events found


Because I like programming in python, the first thing I did was to go get the latest copy of BeautifulSoup, which is a library that is unbelievably handy for scraping data out of HTML and also Google GData which lets me talk to Google Calendar.

I so I began...


import urllib, urlparse, gdata, time, datetime
from bs4 import BeautifulSoup
import atom
import gdata.calendar
import gdata.calendar.service

... and loaded the libraries.  Then I connected to Google Calendar, like this...


print "Connecting to Google Calendar"
calendar_service = gdata.calendar.service.CalendarService()
calendar_service.email = '*********@york.ac.uk'
calendar_service.password = '**********'
calendar_service.source = 'Google-Calendar_Python_Sample-1.0'
calendar_service.ProgrammaticLogin()



 .... then got the web page with the Festival of Ideas events on it like this...


url = 'http://yorkfestivalofideas.com/talks/'
print "reading ", url
u = urllib.urlopen( url )
html = u.read()

... At this point, I knew I wanted to create a separate calendar, so I made one in Google Calendar ( IMPORTANT! Set the timezone of your newly created calendar!!! ). Once I'd done this, I could then find what's called the calendar link which you use to specify which calendar you want events to go into...


def get_my_calendars_url(cal_name):
feed = calendar_service.GetOwnCalendarsFeed()
for i, a_calendar in enumerate(feed.entry):
name = a_calendar.title.text
print i, a_calendar.title.text, a_calendar.link[0].href
if name == cal_name:
return a_calendar.link[0].href


calendar_link = get_my_calendars_url("Festival of Ideas")




So, now I have some HTML with useful information in it and a way of connecting to my chosen calendar... I need to use Beautiful soup to fish out the data I need.  I begin like this...


soup = BeautifulSoup( html )
events = soup.find_all("div", {'class':'event'})


... Now the HTML has been turned into a "soup" which means I can do fancy things with it... like the 2nd line above where I grab any DIV that is of class "event" from code that looks like this..


<div class="event">
<div class="eventdate">
<div class="day">
Thu
</div>
<div class="date">
14
</div>
<div class="month">
Jun
</div>
</div>
<div class="eventdetails">
<p class="eventtitle">
<a href="/talks/2012/frenck/">
Where it all began: The Big Bang
</a>
</p>
<p class="eventteaser">
Professor Carlos Frenk will open this year's York Festival of Ideas with a talk on the biggest metamorphosis of all - that of the universe as a whole, from the simplicity of the Big Bang to the complexity of the universe of galaxies, stars, and the planet on which we live.
</p>
</div>
<div class="clear"></div>


...Once I've got a list of events I can then do this... which finds the title, and the text and the dates and times of the events....



for event in events:
try:
title = event.find('p', {'class':'eventtitle'}).find('a').contents[0].strip()
href = event.find('p', {'class':'eventtitle'}).find('a')['href']
href = urlparse.urljoin(url, href)

#Get the actual page in the href!
u = urllib.urlopen( href )
event_html = u.read()
small_soup = BeautifulSoup(event_html)
start_time = small_soup.find('abbr', {'class':'dtstart'})['title']
st = time.strptime(start_time, "%Y-%m-%dT%H:%M")
end_dt = datetime.datetime(2012, st.tm_mon, st.tm_mday, st.tm_hour+2, 0, 0)
end_time = end_dt.strftime("%Y-%m-%dT%H:%M:%S")
start_time = start_time + ":00" #HACK UG!


teaser = event.find('p', {'class':'eventteaser'}).contents[0].strip()
teaser =  teaser + "\n\n" + href

print "creating event:", title
print create_event(title, teaser, "York, UK", start_time, end_time) 

print "_" * 80
except Exception, err:
print err


.... and the create_event code, which uses that calendar_link mentioned earlier, is...


def create_event( title='A lovely event', 
    content='Some text about it', 
    where='York, UK', start_time=None, end_time=None):


    event = gdata.calendar.CalendarEventEntry()
    event.title = atom.Title(text=title)
    event.content = atom.Content(text=content)
    
    #time_zone = 'Europe/London'
    #event.timezone = gdata.calendar.data.TimeZoneProperty(value=time_zone)
    event.where.append(gdata.calendar.Where(value_string=where))


    if start_time is None:
      # Use current time for the start_time and have the event last 1 hour
      start_time = time.strftime('%Y-%m-%dT%H:%M:%S.000Z', time.gmtime())
      end_time = time.strftime('%Y-%m-%dT%H:%M:%S.000Z', time.gmtime(time.time() + 3600))
    event.when.append(gdata.calendar.When(start_time=start_time, end_time=end_time))


    new_event = calendar_service.InsertEvent(event, calendar_link)


    return new_event



... Putting it all together I got a events that can be displayed in a fairly rubbishy widget ( go to June 2012 to see the events! ) or a calendar that anyone can browse here.

https://www.google.com/calendar/embed?src=york.ac.uk_9d9et5aruukobiaqpgke4n63rk@group.calendar.google.com&ctz=Europe/London&gsessionid=OK









The End Result?


To be honest, presentation isn't Google Calendar's strongpoint is it? It's fugly. It's all about the utility though... and I suppose making sure you get to those events.

I guess my point was, and is, that more of this sort of data should be ending up in places that I can use it, i.e in Google Calendar rather than hiding on a web page somewhere. Maybe this little bit of code will help someone to get their events in a more usable form.



Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. Looks nice, I have had a quick play with Yahoo Pipes to produce a RSS feed and simple image gallery from Festival of Ideas URL

    http://pipes.yahoo.com/pipes/pipe.info?_id=22b1be069e439c08a1d4b7fb082e1c4d

    I think with a little bit more work could create ical feeds

    ReplyDelete

Post a Comment

Popular posts from this blog

Inserting A Google Doc link into a Google Spreadsheet

This article looks at using Apps Script to add new features to a Google Spreadsheet.

At the University of York, various people have been using Google spreadsheets to collect together various project related information. We've found that when collecting lots of different collaborative information from lots of different people that a spreadsheet can work much better than a regular Google Form.

Spreadsheets can be better than Forms for data collection because:

The spreadsheet data saves as you are editing.If you want to fill in half the data and come back later, your data will still be there.The data in a spreadsheet is versioned, so you can see who added what and when and undo it if necessaryThe commenting features are brilliant - especially the "Resolve" button in comments.
One feature we needed was to be able to "attach" Google Docs to certain cells in a spreadsheet. It's easy to just paste in a URL into a spreadsheet cell, but they can often all look too si…

Writing a Simple QR Code Stock Control Spreadsheet

At Theatre, Film & TV they have lots of equipment they loan to students, cameras, microphone, tripod etc. Keeping track of what goes out and what comes back is a difficult job. I have seen a few other departments struggling with the similar "equipment inventory" problems.

A solution I have prototyped uses QR codes, a Google Spreadsheet and a small web application written in Apps Script. The idea is, that each piece of equipment ( or maybe collection of items ) has a QR code on it. Using a standard and free smartphone application to read QR codes, the technician swipes the item and is shown a screen that lets them either check the item out or return it.

The QR app looks like this.



The spreadsheet contains a list of cameras. It has links to images and uses Google Visualisation tools to generate its QR codes. The spreadsheet looks like this.


The Web Application The web application, which only checks items in or out and should be used on a phone in conjunction with a QR cod…

Getting CSV data into Google Spreadsheets Automatically

Today I was attempting to get CSV data from Estates' Alarm System into Google Docs as a spreadsheet. There were two ways to try and achieve this...


Create an AppScript in Google that pulled a .CSV file from a web serverWrite a (python) script on the local machine that pushed the data into Google Spreadsheet by using the API. The Google AppScript Way As you know, my JavaScript ain't great, but it initially looked like it was going to work... Some code like this below and using the Array to CSV functions from here, looked promising.



function encode_utf8( s ){
//This is the code that "I think" turns the UTF16 LE into standard stuff....
return unescape( encodeURIComponent( s ));
}

function get_csv(){
var url ='http://www-users.york.ac.uk/~admn812/alarms.csv.Active BA Alarms.csv';// Change this to the URL of your file
var response = UrlFetchApp.fetch(url);
// If there's an error in the response code, maybe tell someone
//MailApp.sendEmail("s.brown@york.ac.uk&qu…