Andreas Weigend
Stanford University
Stat 252 and MS&E 238

Data Mining and Electronic Business


Homework 3: Recommender System for del.icio.us

In this assignment you will create a simple recommender system for del.icio.us. Here is the use case: If you have just bookmarked www.weigend.com, the program will recommend URLs for you based on your own tagging history and that of others who, at some stage in the past, have also tagged www.weigend.com.
To design a recommendation system, let's start simple.
Find the set of all users who have bookmarked www.weigend.com or weigend.com (in a later version of del.icio.us these will be mapped together, in the current version you need to do this manually).
Looking at all of the URLs tagged by this set of users, how many URLs are tagged by only a single user, how many are tagged by 2 users, 3 users etc.?
When there are many such multiply bookmarked ULRs, you might consider to rank the results based on a "weighted" version that gives different users different weights before aggregation. Two examples for such weights are:
  • heavier weights for users who have more similar tagging history to yours.
  • lower weights for users who have not been using del.icio.us for some time;
Another time dimension to be considered in weighting is: How do you want to treat time
  • relative to the user having tagged weigend.com (i.e., to you want to weigh URLs that came afterwards higher than those that came before?), and
  • relative to the current time (should a URL tagged yesterday be weighted more than a URL tagged a year ago?)
By design, an algorithm that removes noise by focusing on URLs that occur more than once (favoring more popular URLs over less popular URLs) is biasing towards the common. When is this property desired? What are its potentially negative side effects?
To create an algorithm for more bold recommendations, biased towards uniqueness, how would you find the most "interesting" users having tagged weigend.com? And how would you derive recommendations from their most "interesting" bookmarks?
Compare these two different approaches, explain the differences. When would you use the first algorithm, biased towards more frequently tagged URLs, when would you use the second algorithm, biased towards individuality?
So far, you have only considered the fact that others have tagged the URL, and ignored the specific tags as well as the comments. How would you use the number of tags (maybe relative to how many the person usually uses, maybe relative to an overall distribution), how the uniqueness of tags? And how about the comments they leave?
Before you start working on this, first register an account on del.icio.us, and tag some URLs of interest to you (this will be your history). Please put your Python code and 25 ranked recommendations for the target URL on your webpage. Carefully examine and discuss the change in results as you change your parameters. Post the description of the choices you made, the effects you investigated, and a set of 25 recommendations on your page, and submit the link to this webpage to the usual email address by the usual deadline.

This exercise uses Python and the pydelicious (del.icio.us Python API) package. Python and its tutorials are available on www.python.org. On Windows, after you have run the Python installer, you may have to manually add the installation directory to your system path environment variable (Right click on My Computer -> Properties -> Advanced -> Environment Variables). Instructions for getting pydelicious can be found on http://code.google.com/p/pydelicious/source .Please note that in order to check out pydelicious, you will first need a source code version controling system called SVN (Windows binary: http://subversion.tigris.org/files/documents/15/36797/svn-1.4.3-setup.exe; Mac OS: http://metissian.com/projects/macosx/subversion/ ; Linux: you should know). After you have checked out pydelicious source codes, you need to run inside the pydelicious directory,
python setup.py build
python setup.py install

to copy the files to the right directories in your python installation. Only after this will you be able to import this package in Python.

note: Instead of installing python and SVN on your own personal Linux/Windows/Mac machine, you can use Stanford's cluster. They already have Python and SVN installed
First: ssh into one of Stanford's many computer server and type this:
svn checkout http://pydelicious.googlecode.com/svn/trunk/ pydelicious
cd pydelicious
python setup.py build

and now you are ready to run python
>python
>>>import pydelicious
Because we cannot copy the pydelicious code into python directory due to access restriction, you must be inside the pydelicious directory created by svn (use pwd command in Linux to check path). --jack


Here are some code snippets to illustrate how to use pydelicious to get information from del.icio.us,
1) get the list of posts for a given URL (each post corresponds to a user's bookmark entry of this URL). The post is a data object containing URL, user, timestamp etc. function pydelicious.get_urlposts will return a list of dictionaries each correponding to a post.
import pydelicious
up = pydelicious.get_urlposts("http://www.weigend.com/")
To see what a post looks like, print up[1] (note up[0] is the first post)
{ 'count': '',
'description': u'[from zehao_chen] \u261e Andreas S. WEIGEND, PhD',
'dt': u'2007-04-23T21:00:09Z',
'extended': u'',
'hash': '',
'href': u'http://del.icio.us/url/e78668880a3ac1f62edbd3449e095df3#zehao_chen',
tags': u'internet web2.0',
'user': u'zehao_chen'}
You can get all the infromation you need from this list of posts. Please note you have to write the full URL with a leading "http:" and a trailing slash, for example, www.weigend.com, http://www.weigend.com will not work. Only http://www.weigend.com/ works.
2) get the list of recent posts for a given user
import pydelicious
up = pydelicious.get_userposts("zehao_chen")
You can only get recent posts up to a limited number.
3) get the list of recent posts for a given tag
import pydelicious
up = pydelicious.get_tagposts("programming")
For more information, please refer to doc.txt in your pydelicious directory.