Last year, around this time, I wondered how Dave Smith obtained his course data for the Penn State scheduling website LionSchedules.com. Without an official API that serves the data, he would have to manually scrape the Schedule of Courses.
I forgot about this thought until recently Alan deLevie (Sophomore, Political Science) mentioned to me his own experiment with the Schedule of Courses. He was having problems with parsing the course section information. So I decided to give it a shot.
Having learned a bit of Python over the past month, I realized that scraping the Schedule of Courses is not a difficult task. I put in a few hours yesterday getting Beautiful Soup to successfully scrape all the data present on one of the course lists.
The Penn State Schedule of Courses is running on an ancient script that is horribly coded based on current standards. It’s a big jumble of table, font and bold tags. The most challenging aspect of scraping this site was grouping the course information and individual section information together, since they are located in two adjacent but separate tables.
Based on Alan’s recommendation, I’ll work on storing all of the course data in a database and making it available through a free API.
The Buzz {1 trackbacks/pingbacks}
The Conversation {1 comments}
Until I get trackbacks working: http://www.alandelevie.com/2008/07/18/penn-states-schedule-of-courses-un-dapped-potential/
Leave Your Own Comment
You can follow any responses to this entry via its RSS comments feed. You can also leave a trackback if the inclination is there.