Andreas Weigend
Stanford University
Stat 252 and MS&E 238

Data Mining and Electronic Business

Class 3(04/23/07) - Attention Data: Discovery


  • When you search for something, you know it exists.
  • Discovery means you find something new, perhaps serendipitously, even if you are not looking for it.
  • Search can lead to discovery, but there must be an intention involved.
  • A key element of discovery is that it brings others' discoveries to you.


How do we make discoveries?

  • Recommendation by friends
  • From books, newspapers
  • Through feeds and emails
  • From popular lists (example top 200 movies in IMDB)
  • Recommender systems such as Stumble Upon

Email vs RSS

  • E-mail is a store and forward method of composing, sending, storing, and receiving messages over communication systems.
  • RSS is a family of web feed formats used to publish frequently updated digital content such as blogs, news feeds or podcasts.
  • Email can be thought of as a 'push' while RSS can be thought of as a 'pull'.
  • Email is well suited for communication between a few people while RSS is well suited for massive dissemination.
  • Email has expectation that it will be read by the recipient while RSS has no such expectation.
  • RSS has a strong discovery component broken into two parts - discovery of feeds and discovery of individual articles.
  • RSS uses social mechanisms to facilitate feeds

Exploitation vs Exploration in data mining

  • There is a trade-off between exploitation and exploration in data mining
  • Exploitation - Something that is known to work is taken advantage of (e.g. in the search arena)
  • Exploration - Try something new that may or may not be productive (if it is productive, it may be exploited later)


* Tagging allows the sharing of bookmarks based on user-defined labels

  • Progression to tagging
    • Tag - Label Items. To remember, share, discover.
    • Search - Express intention.
    • Click - Only able to click on links given.
  • Amount of specificity increases upward (gives more attention to something
  • Tags are distilled attention, a pure form of attention.
  • "You are what you tag."
  • "You are who you are tagged by, what you are tagged as."
  • The driver for tagging is that people want to remember something or express attention.
  • If a person is aware of how others tagged an item, it could minimize that person's creativity when tagging that and other similar items.
  • Tagging allows a person to share their organization of the world, e.g. sharing bookmarks
  • Tagging allows a person to order items the way they'd like and is not dictated by page rank or similar algorithms
  • Tags add another layer of connection on the web
  • Examples of tagging sites

  • A collection of favorites to
    (collective intelligence)
    • Keep links of your favorites
    • Share favorites
    • Discover new things (someone's favorites)
  • A social bookmarking websites
  • Tag links and share them with the rest of the world
  • and other bookmarking sites allow a person to put a different topology on top of the web.
  • If your ranking function depends on your past, you can re-weigh your keywords using your tags
  • The terms that people use to tag a link can be used to make many inferences about the link
  • You can follow the attention stream of others by subscribing to the trail of bookmarks that they leave.
  • now allows users to tag links as private
  • Related tags - shows a list of tags that are related to the topic that we are looking for.
  • Recognition is given to the first person to tag (and perhaps discover) a site by displaying his/her name
  • A tag's meaning can be determined by its use (Given a tag, what do others use it for?)
  • Example
    • Tag
    • Discovery:
      • Who are the other people who have tagged JetBlue?
      • What are the tags that they have used?
      • What are the other links that people who have tagged JetBlue, also tagged?

Interesting observations on

  • and won't be identified as the same thing and merged
  • Joshua, the founder doesn't want any spell correction (of tags)
  • private bookmarks are not allowed after discussion with VC

Discovery on

  • Discover a topic by following a tag.
  • Discover what a user is interested in by following the user.
  • Discover other users by looking at a page and seeing who has tagged that page.
  • Discover new sites with related information.

Note on Self-Selection on the Web

  • The web contains a huge number of communities that people can choose to be a part of.
  • Self-selection allows people to avoid things they don't want to see.


  • The measure of interestingness of any site/article is based on the attention data of people. "People remember what is worth remembering."
  • Amongst the purposes of collecting attention data (web, RSS reading, mobile, e-mail, etc) is to help discover new things via social processes (Ex. from discoveries of others)
  • How to measure attention
    • Attention economy
  • Examples
    • Hitwise (next week's class):
      • Has millions of users - Hitwise is an internet monitor which collects data directly from ISP networks. Hitwise bands the aggregate usage information into commercial verticals (travel, finance, retail, etc).
    • Comscore: similar to Alexa, they provide website traffic data through users who installed their proprietary monitoring software
    • Quantcast, not only do they provide website traffic data, they also provide the demographics of the visitors to the website (e.g. age, gender etc)
    • Alexa, collects web traffic information from users that downloaded the Alexa toolbar and provide website traffic information based on such user information together with some statistical sampling manipulation.
    • Compete, similar to but unlike Alexa, it has unique visitors estimation as well as distinguishing US visitors from the international visitors
    • Consumer Confidence (
      • Well measured indices (prediction in the stock market)
      • Robust non-representative index (unbiased)
      • Important to get the evaluation criteria out upfront

Recommender Systems

How to make a product recommendation

  • Information sources
    • Product Space
      • Price
        • How to represent the price of a product
          • Cleverset
            • Normalized logarithm of price
            • Substitutables - sort and look at the decile/percentile of where this ranks with respect to other substitutables in the list)
      • Consumer feedback
        • Popularity
      • Attributes
        • Staying on the surface of attributes is, deep diving is difficult
          • Example of IBM Research Team
        • Stop words are removed
        • Information Gain (of all words)
        • Singular Value Decomposition (look at Eigen value drop)
        • Independent Component Analysis
          • Technique of extracting from a high dimensional space, a low dimensional hyper-plane that contains most of the information and less noise
      • Text
        • The text in the database that describes the product. (Absolute attribute specification can be also used but it makes the building of recommendation system complex task.) Initially we apply methods like stemming, truncation to get rid of stop words etc thus producing feature set which is not diluted. We use information gain to know about weighted words among the words extracted.
      • Brand
        • Match high level brands with people who associated with high level brands
  • Product recommendations vs ads (differences: number of ads vs products, evaluation as buy vs click)
      • Method of payment: Evaluation as Click, exposure, or brand
      • Number of ads vs products

  • Reinforcement Learning
    • Associates with an action, the expected return for the action. Over time, with experiments it learns by reinforcement of the expected return

Evaluation of recommender systems

  • The importance of evaluation ("closed loop feedback")
    • Figure out what to predict (facts) and navigate to them
    • Find which predicted the facts more accurately
    • Point predictions or probability distributions
  • Closed loop systems
    • Big advantage of online recommender systems
      • Allows the measurement of how small changes to an algorithm affect evaluation criteria
      • Examples
        • Click probability
        • Purchase probability
        • Revenue of one algorithm vs revenue of a second algorithm
    • Different ways of representation (both for products and ads, item filtering)

Contextual recommendation systems (Context of page vs Context of user)

  • Page context (analyze words on a page)
    • AdSense, etc
  • User context (state)
    • Eric Horwitz of Microsoft has done extensive research in this area using models
    • Examples of contextual data
      • Location
      • Phone vs Broadband
      • Time of day
      • Persistent history of the user
      • Behavioral targeting and use of active learning
        • Google is very good at this
        • nugg is another example
      • Note: Demographics are more constant and not a good example of useful contextual data
    • Examples of usage
      • A person who performs a search for a specific DVD player vs someone who performs a search for "DVD player"
      • A person who is referred by a price comparison site vs a person who directly clicks

Examples of recommender systems

  • Mood Logic - Music : To find a space where similarly perceived songs are close-by
  • Stumble Upon
    • The company's product is a browser plugin which provides a way for users who are browsing the internet to discover websites in which they will probably be interested. Suggestions are based upon the ratings they give to other websites, their own preference settings, the ratings their friends give, etc.
external image hp_toolbar_ss.gif
external image hp_toolbar_ss.gif


external image 2640270303001.png
external image 2640270303001.png
external image recosystems_powerlawcurve.jpg
external image recosystems_powerlawcurve.jpg

Guest Speaker: Joshua Schachter

(Founder, & Director, Yahoo Social Search)
  • The purpose of an online bookmarking system is that people can save things they like or for potential recall later.
Interesting facts about
  • 300k users in Dec 2005, that rose to 1 million users by Sep, 2006; the graph below shows the traffic rank of that is clearly increasing ever since it was launched.
  • Passed 2 million users in Feb 2007
  • 180 million overall bookmarks (or 80 million unique URLs) as of Feb, 2007
  • Avg : 2 tags/bookmarg for things that people type in by hand
  • Avg user adds about 30-40 bookmarks per week (median 50 bookmarks per week)
  • 90% of pages bookmarked only once
  • Most popular : Google, Slashdot, Amazon
  • Yahoo, Amazon, Wikipedia
  • Individual page : 30k
  • Amazon total : 500k
  • For bookmarks (more than 10), about 50% of users converge on a common term for tagging - Too many overlaps reflect spamming
  • Anomalous behavior - spam and other abuse
  • Initially, every part of the sire was on RSS so that people could pay attention - to get more people to know about
  • Dynamic behavior:
  • Eg. which memes travel together
  • Before:
    • About a third of users who signed up only bookmark something. Of those, a third of them still use the system a week later. After that, the drop is very low.
  • Now:
    • The activity is higher because it is easier to download and install (with Firefox extension plugin - 1 million downloads and installs)
  • Intended users:
    • People who swim in lots of information
    • Not: here is my financial institution, and done.
  • Info production vs info retrieval/access
    • Production
      • Signalling
        • Less work than a blog
  • Access
    • Own
    • Friends
    • Anyone
  • Digg vs
    • Digg is discovery oriented
    • focuses on memory
  • Tradeoff:
    • People produce more if it is private
    • Also, connotations of 'don't share' vs private
  • Externalities:
    • While you might like your financial transactions to be anonymous, the world is a better place if everybody does.
    • Discovery
      • Strong driver
      • Cost of saving something is low so that people save often.
  • Viral Features
    • Your bookmarks of the day for your blog
    • Importing & Exporting bookmarks
    • RSS on every page, Blog integration, JSON Badges, API's
  • API
    • Several million hits per day
  • Applications
    • What percentage of bookmarks today are hit - PHP vs Python
  • Social System
    • Wikipedia - nice end result but ugly wars (eg. Harp players)
      • High score list - strong negative effects on the system
      • 2nd order effects tend to te (2nd order effect = consequences of consequences)
    • Feedback
      • People ask for the same things again and again
      • Alphabetize
      • They want to but they don't use it.
      • Good example of wanting something being different from using it when it is there
      • Publisher wants to pre-fill the tags
  • Recommended tags on is not a recommendation system but an intersection of your tags and other people's tags

  • Future of social software
    • Current browser: tabbed version of Mosaic
    • More social browser: People & Pages
    • Need to move up in the abstraction layer
    • Facebook and , etc are also identity
    • Identity & Attention are not taken into account by the actual grain...

Related Links


Initial Contributors