Andreas Weigend
Stanford University
Stat 252 and MS&E 238

Data Mining and Electronic Business

Homework 1

Assignment 1 is posted on as pdf. Some PHP code for a crawler is provided:
The homework is back due via email to on Sunday, April 15, 2007 by 5PM.

Some feedback on homework 1
Generally our fellow students are doing a very good job on this homework assignment. For part 1 on Google Analytics, most people have provided a good and clear discussion. For part 2, as for modifying the script, many students change the crawler to retrieve information of other items on craigslist, for example bike. There are also some novel and interesting applications that are worth mentioning,

1) Ashis Roy had a script to search through local area second-hand car ads and look for ones satisfying some budget constraint.

2) Trent Peterson is using his script to compare the average real estate prices for different areas. This could potentially provide really useful information. He's also done a good job in optimizing the PHP script. Mark Brenneman did a similar analysis, too.

3) Yaron Grief wrote a script to analyze the relationship on a social bookmarking site The results are intuitive: the higher ranking the bookmark is, the more comments it has.

4) Ron Gonzalez used a java program to get movie genre information from IMDB, which is part of the attempt of doing netflix challenge. The basic idea is to analyze the genres of a netflix user's favourite movies and use this information to make better recommendations.

5) Jacob Bien applied the crawler to analyze the densities of relevant links of different Wikipedia topics. He has found significant differences between two of his topics studied Juggling ( and John L. Hennessy (//

6) Jerome Ku has an interesting propostion to crawl US News and World Report online college directory to model the relationship between college ranking, cost of tuition and number of applicants.

7) Alison Love has a very interesting test on the Google engine. She used the crawler to study the relationship of the number of occurences of a given keyword in a page and the page's PageRank. She's found out that contrary to popular speculation, they have no significant relationship.

Links to generate traffic for HW1, please kindly click them to generate traffic for our fellow students: (graduate project my wife is working on).