Sunday, February 27, 2011

How I created my weekly feed digest

So, if you read the frizzlefry blog, then you probably noticed I had an issue today that was all about my weekly shared Items in Google Reader.  I have been thinking about reorganizing all of my shared feeds into a series of digests.  The nice part, is that other people did all the work for me, so all I need to do is put their work together into a script I can run.

So, where to start?  ok, first I want to parse out an Atom feed, so I chose the Universal Feed Parser which is a nice little utility to parse out RSS/Atom feeds among other stuff I'm not currently interested in.  So, because the library I wanted to use is in Python, I decided that I was going to write a little python script to get the work done... I guess the language of a library is as good a reason as any to choose your language.  '

Anyway, ok, so how does this thing work?  really simple as it turns out...
To start off, I import the feedparser library, and create a feed object:

import feedparser

d = feedparser.parse('http://www.google.com/reader/public/atom/user%2F11424318483849164397%2Fstate%2Fcom.google%2Fbroadcast')


So, that creates a variable d, which is a feed object.  All I want from the feed is the title and description, so I can access those simply by using:


import feedparser

d = feedparser.parse('http://www.google.com/reader/public/atom/user%2F11424318483849164397%2Fstate%2Fcom.google%2Fbroadcast')

d.entries[itemnumber].title
d.entries[itemnumber].description


However, I want to print out those values, which can be done with the print like this:


import feedparser

d = feedparser.parse('http://www.google.com/reader/public/atom/user%2F11424318483849164397%2Fstate%2Fcom.google%2Fbroadcast')

print d.entries[itemnumber].title
print d.entries[itemnumber].description


Next, I want itemnumber to mean something, so I'll put the whole thing together in a for loop that prints all entries like this:


import feedparser

d = feedparser.parse('http://www.google.com/reader/public/atom/user%2F11424318483849164397%2Fstate%2Fcom.google%2Fbroadcast')

for itemnumber in range(0, len(d.entries)):
   print d.entries[itemnumber].title
   print d.entries[itemnumber].description


Now, I wanted to make sure that I limit the list to only include items after a given start date that I could specify as a command line argument, so I added a couple more lines:


import feedparser

startdate = datetime.strptime(sys.argv[1],'%Y-%m-%d').date()
d = feedparser.parse('http://www.google.com/reader/public/atom/user%2F11424318483849164397%2Fstate%2Fcom.google%2Fbroadcast')

for itemnumber in range(0, len(d.entries)):
   print d.entries[itemnumber].title
   print d.entries[itemnumber].description
   if datetime.strptime(d.entries[itemnumber+1].published,'%Y-%m-%dT%H:%M:%SZ').date() < startdate: break


So, I created the startdate variable, which is a date provided by the command line in the format
--, then it breaks the loop if date published is before (less than) startdate. 

Finally, I wanted to add some basic HTML formatting to make it easy for me to copy and paste the text into blogger... that lead me to my final version of the python script, which looks a little like this:


import feedparser
from datetime import date
from datetime import datetime
import sys

print sys.argv[1]
startdate = datetime.strptime(sys.argv[1],'%Y-%m-%d').date()

d = feedparser.parse('http://www.google.com/reader/public/atom/user%2F11424318483849164397%2Fstate%2Fcom.google%2Fbroadcast')
print '<html><head><title>'

print d.feed.title
print '</title></head><body>'

for itemnumber in range(0,len(d.entries)):

print '<h1>'
print d.entries[itemnumber].title
print '</h1>'
print d.entries[itemnumber].description
print '<br /><br />Date Shared: '
print datetime.strptime(d.entries[itemnumber].published,'%Y-%m-%dT%H:%M:%SZ').date()

if datetime.strptime(d.entries[itemnumber+1].published,'%Y-%m-%dT%H:%M:%SZ').date() < startdate: break
print '<hr />'
print '<hr />'
print '<hr />'

print '</body></html>'


So, the first part of my script was done... I named the file GeneratePage.py, and proceeded to create three more files... the second file was SaveDate.py, which would simply output the date in the format I needed it using two lines:

from datetime import date
print date.today()


I figured the best way to track the previous time the script was run would be to save the date to a file, so I created a file called date.txt that contained only the string "2011-02-26".

Then I created a batch script that would tie everything together... the batch script looks like this:

#!/bin/bash
D=`cat date.txt`
python GeneratePage.py $D
python SaveDate.py>date.txt


This script would set the date from date.txt to a variable called D. Then it would run the GeneratePage.py script while passing it the value of D. Then it would run the SaveDate.py script, with the output replacing the contents of date.txt.

At this point, the script could be run every week, and the output can be copied and pasted into blogger.

The next step will be to automate the process of putting the dialog onto the site, which will actually take a little effort, but I'm pretty sure someone else has already done the work for me, I just have to figure out who did it.

Oh, on one other note, in case you are curious why I didn't just redirect all the output to a file, it's because some of the feed items use Unicode characters that cannot be converted to ASCII... so Bash gets angry when you do that.  I might be able to fix this easily, however, I really don't mind having to copy and paste the characters into the blog anyway.

Later,

     SteveO