RSS 2.0 Import

This is live blogging of boring things. Suffer with me as I spend my Saturday night working on something that I’ve longed to have finished for some time. A waste of bits and bandwidth, and dear reader, a waste of your precious time, but I must, must get this completed.

Let this be a record of me doing something long after it is exciting or amusing. This is work. Unpaid work. Whee!

First things first, define the problem. The core problem is I need more content for Think New Orleans. My tagging and geocoding interfaces are far along, but I don’t have a generic import.

Articles in Think New Orleans are stored as Atom 1.0. They are indexed in a MySQL database. I’ve created an RSS 2.0 to Atom 1.0 conversion pipeline in Relay. This is an XSLT 2.0 transform, and a special pipeline that will parse the contents of the RSS 2.0 description and content:encoded elements to XHTML, so that the RSS 2.0 feed will be an Atom 1.0 feed. If you are interested in that then you can fill out this form and get a conversion of a WordPress feed.

I’m importing WordPress RSS 2.0 to start. Why? Because Maitri runs WordPress, and her blog is an excellent post-Katrina, New Orleans resource.

Not all RSS 2.0 feeds are created equal. Some are quite crufty. WordPress does a good job however.

An important part of Maitri’s blog are the comments. There is a lot of great information there. The RSS 2.0 feed has a special element that references another RSS 2.0 feed for the comments of a blog entry. Pity, then, that the RSS 2.0 comments feeds are 404s at vatul.net. That is a set back.

In any case, here I am, needing to pull in the comments, whether they work or not. That is, I’m sure they’ll be working if I beseech the blogger in question. So, here goes.

(Is this literate programming? Something I’ve been meaning to blog about.)

Feeds

Gave it some thought. Let’s pick on me for a change. Let’s say I have two blogs, one specifically about New Orleans, and one a simpering plea for attention. How do I go about representing these blogs in Think New Orleans?

Well, as noted before, there are People, Circles, and Syndicated Feeds in Think New Orleans. But here’s an annoyance. How does one represent a syndicated feed for a person.

The following is the URI for the tags I generate for the tag uptown.

http://thinknola.com/alan/tags/uptown/index.html

Following the established pattern for importing feeds, the taggable import feed for this blog would look something like this.

http://thinknola.com/taggable/feeds/engrm.com/blogometer/index.html

And I like it. I’d rather have an administrator add feeds to the system, than have people assign whatever feed suits them. In this way a feed is authoritative, no duplicates. In the future, I might have to pass the URI of the source feed as parameter, but for now, I want them to be readable.

My tables for feeds look like this.

CREATE TABLE feeds (
    feed_id    integer auto_increment,
    guid       text,
    url        text,
    path       varchar(255),
    alternate  varchar(255)
)
g

Thus, I’ll insert the engrm.com blog using the following.

INSERT INTO feeds(guid, url, path, alternate)
VALUES('tag:thinknola.com,2005:feed:convert:/engrm.com/blogometer/',
       'http://engrm.com/blogometer/feed/',
       'engrm.com/blogometer',
       'http://engrm.com/blogometer/')
g

I’m not much of a stickler for normalization these days. Use to be all OCD about it. In fact, why even use MySQL? Why not use Lucene and only Lucene?

Good question. I’m glad I asked.

Because the joins will be interesting. It’s how the tagging is done. Maybe you can do that in Lucene just as easily, but I’m pretty good with a select statement.

Anyway, I’m going to have to add a field to the feeds table. It will be a feed type. It will be a simple text field. And I’ll add RSS 2.0 for the value for my blog since that’s the feed flavor.

Okay, here’s the DDL.

ALTER TABLE feeds
ADD COLUMN type VARCHAR(32) NOT NULL DEFAULT RSS 2.0'
g

Tributaries

Lets call the Atom 1.0 feeds for comments tributaries. One to many. No hierarchy as of yet. I’m not going to search the comments feeds for comments feeds, and threading is still a ways out.

CREATE TABLE tributaries (
    tributary_id  INTEGER AUTO_INCREMENT,
    feed_id INTEGER,
    uri VARCHAR(255),
    reference_guid VARCHAR(255),
    created TIMESTAMP,
    PRIMARY KEY (auto_increment)
    FOREIGN KEY (feed_id) REFERENCES feeds(feed_id)
)
g

The created stamp will be used as offset for updates. Anything created in the last 24 hours is polled every fifteen minutes. Anything created in the last seven days, is polled hourly. Easy SQL. After that, I’ll have to pick some way of chunking through the feeds daily or every other day.

Convert from RSS 2.0

I’ve already got a nice RSS 2.0 to Atom 1.0 converter. Let me roll out the WAR for you.

http://thinknola.com/beta/convert/from/rss-2-0.xml?uri=http%3A%2F%2Fengrm.com%2Fblogometer%2Ffeed%2F

http://thinknola.com/beta/convert/from/rss-2-0.xml?uri=http%3A%2F%2Fengrm.com%2Fblogometer%2Ffeed%2F - Convert RSS 2.0 to Atom 1.0.

You can choose a feed to convert using this simple form. At this point, it should only really be working with WordPress RSS 2.0 feeds.

Import Comments as Tributaries

This turned out to be simple. I created a Relay pipeline that generates an Enfilade document, and put it in a tee, which is a runner that branches, and the table is not being populated. Now I need to get back to this, and create a pipeline that will import comments, and add the new Atom elements for threaded discussions.

WATCH THIS SPACE

Import Entries

Now that entries are Atom 1.0, I can import them. This bit can be reused for other entry types of feed as well. Once you’ve built the Atom 1.0 feed, this is the import procedure.

There are other imports, there building an entry is an expensive operation. It’s not cheap with RSS 2.0, because I’m parsing the description and content using JTidy. It would be nice to not convert the entries that already exist in the database, but it’s not so darn expensive as all that.

It’s easiest to import the article body into the filesystem first, before putting the documents in the MySQL database. If the entry is written to file before it is written to the database, the database will appear to not contain that file, so it will simply try to import it again during the next import. If the database record is written before the file is saved, and the file save fails, all kinds of bad things happen. This means that I need to first check if the article doesn’t already exist, so I don’t write it to file, but then find it’s already in the datbase.

The procedure is as follows.

  1. Check which entries already exist in the database, strip those that exist from the feed.
  2. Write the entries to the filesystem.
  3. Create records for each entry in the database.

Checking if the entry exists requires two pipelines. One takes the feed and queries the database to create an XML document that looks like this.

<pre>
<exists>
  <guid>http://engrm.com/blogometer/2005/12/19/the-corporate-blogging-book/</guid>
  <guid>http://engrm.com/blogometer/2005/12/18/keeping-the-ball-rolling-on-big-blogging/</guid>
  <guid>http://engrm.com/blogometer/2005/12/17/have-a-rupee-leave-a-rupee/</guid>
</exists>
</pre>

This document will contain a <guid/> element for each entry in the current version of the feed that exists in the database. Then I’ll use an identity XSLT transform on the feed that references this document, and excludes those entries that exist in the document to create a feed document. Get it? It’s an identity transform. It creates a copy but skips those entries that can be found in the existence document above.

Then I run the document through my existing import script, which splits the document out into the filesystem, and writes to the database.

Better Living Through cron

Now it’s time to create a client program that will run this import periodically. I’m going to schedule a script in my crontab to poll the feeds. The script will simply make an HTTP GET request for each feed, invoking the relay pipelines, and saving the output to file.

The scripting language will be XSLT of course. But, I need to get a question answered here. What happens to an XSLT transform when the result of a query is an error of some sort? Error handling in Relay is quite spotty at the moment. Definitely something that needs more attention.

Leave a Reply