Thursday, 23 May 2013

Getting RSS from a site that doesn't offer RSS using Dapper and Yahoo Pipes

I was asked if it was possible to get RSS from a site that doesn't offer RSS.

One site whose content I was interested in was "Community & Networks Connection" - it aggregates lots of "community and collaboration software" news.

 Although the site offers RSS feeds, the news in the RSS feed looks like this below - all of the articles are chunked into daily digests forcing you to click through to the site and never, ever, catching your eye.


Of course it would be possible to screen-scrape the data from the site and republish as RSS, maybe using a scripting language or the excellent ScaperWiki tool, but I really wanted something that anyone could use... in seconds.


Dapper To The Rescue

I began by visiting Dapper, a tool that lets you point and click and select which bits of a page you want to scrape. I began by clicking on the images of the news articles at the top.



After a little fiddling, you can choose whether you want that data in RSS or CSV or even as a Google Map. ( It really does take some fiddling and pruning to work out what you do here. Dapper is an astonishingly wonderful tool, I've never seen anything that does what it does with such elegance, but it does work once you've got your head around it. )

I could then choose to add my new RSS feed to my RSS Reader, but I actually made another Dapp that got the articles lower down the page. That now leaves me with two RSS feeds which I don't really want.

One of the "dapps" I created is here:
http://open.dapper.net/dapp-howto-use.php?dappName=CommunitiesandNetworkConnectionDapperVersion2



Yahoo Pipes To The Rescue

Yahoo Pipes is a wonderful visual tool for "piping" together different information sources and republishing it again. The pipe I created ( shown below ) looks like this and takes the two RSS feeds ( at the top ) from Dapper, joins them together ( Union ) , strips out any duplicates ( Unique ) and lastly filters out any junk posts.



The RSS feed that Yahoo Pipes creates is here:
http://pipes.yahoo.com/pipes/pipe.run?_id=10c40fa02b113c58042af74deead0c1a&_render=rss

And it looks a bit like this:


After a few minutes configuring using point and click tools, I can now keep in touch with the news from the site from my news reader. 

5 comments:

  1. It probably takes you away from 'something anyone can use' but Yahoo Pipes also asked you to create a feed fetching data using XPath.

    Thanks for reminding me about Dapper

    ReplyDelete
  2. XPath is a bit of a brain ache... I've only used it by copying and pasting other people's example. Thanks for that though.

    Also, there's also no way to, er, "pipe" input variables to Yahoo Pipes is there? That'd be handy... a sort of http://pipes.yahoo.com/pipe/f6gfge7geu6sdsfd?day=Tuesday

    ReplyDelete
  3. Hi Tom & Martin. Have you had any problems with Dapper misbehaving? I started using it a while ago, but got frustrated with Dapps failing regularly. Tom, re Input variables in Yahoo Pipes, I'm not sure it's exactly what you're after, but the "User inputs" allow you to publish the pipe and then people can enter their own search into it too. Here's an example: http://pipes.yahoo.com/pipes/pipe.info?_id=qLeMq8782xG2oyVwCB2yXQ

    ReplyDelete
  4. I hadn't used Dapper for a LLLOOOOONNNNG time, and re-discovered it last week. I must admit, it did use to have a high quirk factor, but this time seemed to work fine. I haven't tested it though.

    For simple jobs I think it's worth persevering with, especially if the alternative is .... "So, now let me teach you regular expressions or Xpath, and then we can take a look at cron" :-)

    ps. I didn't mean user inputs, I meant URL based inputs... thanks though.

    ReplyDelete
  5. I'll have to give it another go. I did like what it could do when it worked. That and Yahoo pipes suit me better than other programming methods.

    I see what you mean about the url inputs - yes, that would be very useful.

    ReplyDelete