Andreas Weigend
Stanford University
Stat 252 and MS&E 238

Data Mining and Electronic Business


Burce D'Ambrosio, CleverSet presentation slides.

Class 7


Email / Prep for students before class
    • Textbook for the Python experiment was recommended to Prof.Weigend by the CTO of O'Reilly (recommendation out of context)
    • Librarian recommending Google for web search.
    • Malcolm Gladwell’s Tipping Point (through multiple courses, design school application)
    • Travel websites, including reading about in the paper

5-step framework (Iterative)


1. Problem

    • Objective

      • What are you trying to accomplish?
    • Metrics

      • Expected value, parameters for comparison and measure of how well we are doing, based on application
    • Baselines

      • What does it mean to do well?

2. Data

    • Data Collection - How are we going to collect data?
    • The last decade has not seen much in terms of progress in terms of algorithms but a lot of progress in terms of data collection.
    • Eg. D.E Shaw in the early 90s had the cleanest database for high frequency foreign exchange rates - which gave them an edge purely on the data collection side.

3. Actions

    • What are we going to do with the data?
    • Turn our insights into business
    • Wall Street: Simply put, to buy or sell (small action space) but hard (large noise levels)
    • Amazon/EBay: millions of options (large action space) but not hard to sell items (small noise levels)

4. Model

    • Model class
      • Non linear global model like a neural network, regression tree, Support Vector Machines, etc. - each model makes certain assumptions of the world.
      • Performance depends on whether the assumptions actually correspond to the real world.
      • How to treat time in the model intrinsically (as in a recurrent neural network or a Hidden Markov Model) or do we factor it out or ignore it.
    • Loss function (or Objective function)
      • What are we trying to minimize & how are we going to minimize it?
    • Optimization
      • How to optimize / which algorithm to use?
      • Gradient decent, 2nd order methods, Broyden-Fletcher-Goldfarb-Shanno
      • Random search to explore landscape
    • Variable Selection
      • Efficient algorithms for searching feature space

5. Analysis

    • Back Testing: Running a model realtime to see what's coming
      • Evaluation
        • Loss function is different from the evaluating function (what does it mean to do well?)
      • Interpretation
        • Explanation: What counts as an explanation?

Outline of today’s class: Recommender Systems

    • Problem

      • What real-world problem is being solved by recommender systems?
      • Variations of the problem
        • Different Applications
          • Navigation / Advertising / News
        • Frameworks
          • Collaborative Filtering (many uses of this term)
          • Social Networks
          • Relational Models
    • Data and Data models

      • Three dimensions of data modeling: Visitor, Product Catalog (Items), Site
      • Visitor Data

        • Interactions : current/historic, implicit/explicit
        • Demographics
        • Persistent

          • Registration Info (eg. Name, Gender, DOB)
          • Order History
        • Session-level

          • Referring (Did they click on a link to the site?)
          • Time (of day, day of week, season/event)
          • IP based info (Browser, Connection speed, etc.)
        • Page-View

          • Time on page
          • Page actions
          • Page transition
          • Page characteristics
        • Below these three levels, we can collect info about what visitors are doing on the page, how the cursor is moving around within the page itself

Attribute
Relation
Set
Behavior Model

Instance

Social Network

        • Note: This model identifies general properties, defines clusters etc (e.g., ppl who spend less than 3 secs on this page are likely to do X)
        • Behavioral Model : a model of how visitors are going to interact, model is an attribute model (eg. visitors coming from southwest of the United States of high speed internet, looking for expensive swimwear) in which we categorize visitors into groups (but not about relational information across groups)
        • Social Network: Relational information over instances
          • Recommendations based on social network : Using information between individuals to do a recommendation, eg. Individual X is looking at Product Y, and you are the friend of the Individual X (have a relation with visitor X), so Product Y will be shown to you
          • Question is: How can we extract the social relationships? Example: Amazon's Share The Love





    • Product Catalog Modeling
      • Retailers have categories in their catalogs
      • Product categories are often not very useful
      • Customers have a very differenv view of the catalogs contents and its categories than does the retailer
      • Customers' views can differ from visit to visit and of course each customers views also differ
        • views vary based on experience
        • views vary based on objectives
      • Challenge
        • Develop visitor-based catalog model
        • Instance / abstraction
          • Xxx has 3.2 Megapixels
          • batteries are accessories for cameras
        • Attribute / relational
          • Xxx has 3.2 Megapixels
          • People who considered Xxx also considered y
    • Catalog II


Attribute
Relation
Set
Product Catalog (Categories, Brand ranking)

Instance
Product Catalog
Collaborative Filtering
      • Catalog information is primarily filled with attribute information (the upper left cell).
      • Collaborative Filtering is instance based relational and exploits relationships among product pairs, buyers and what they buy
        • A ringtone company wanted to know based on prior music purchase info, what their buyers will buy next?
        • Relation information and instance information available from both the company and the web was combined to provide an answer.
        • Web
      • Note: Of the 12 possibly information sources, Collaborative Filtering only addresses ONE SINGLE of these 12 cells.
      • While possibly a decent strategy for academic research, to solve the biz problem, it is unwise to ignore the remaining 11.
    • Site Modeling

      • Visitor interfaces with catalog, through we browser and then where the customer is on/in the catalog.
        • Page type
          • Static predifined page categories
          • Page can look different depending on audience or session intention
        • Location
        • Content
          • Length
          • Type
          • Structured / unstructured
    • Site II

|| || Attribute || Relationship ||
Set
Page type

Instance
Page entry / exit
Page Transition
      • Most data is site focused
        • Page transition, clickstream
    • Performance or Evaluation of recommender Systems

    • How do we know we are doing well?
      • Offline VS Online
      • Offline
      • Offline is great academic exercise but in real Online is proven much useful.
      • Offline is where data is analyzed in a box
      • Quality is gauged by how well you predict what people are going to buy next
      • Very difficult to do because of the variety of products available, your probability of predicting is very small
      • Not much progress has been made using this method
      • Most companies are not willing to allow researchers use Online methods.
      • Paper by Hal Varian, the hopes of the paper are still not recognized.
      • Online is far superior
      • Big advantage of online recommender systems
      • Allows the measurement of how small changes to an algorithm affect the evaluation criteria.
      • Examples
        • Click
        • Purchase probability
        • Revenue of one algorithm vs revenue of second algorithm
      • Bottom line for most retailers is typically revenue generated, but some others
        • maxamize margin
        • sell into the long tail
        • cross selling
        • end user goals
        • optimize revenue
    • Recall HW6

      • Part 3 -- Netflix contest. The online movie rental company Netflix ran a data mining contest asking participants to predict how much a user will enjoy a movie based on his/her previous movie ratings, see http://www.netflixprize.com/. Besides marketing, the hope for Netflix might have been to get ideas on how to improve their algorithm. Please take the following questions as a starting point for a one-page critical evaluation of the Million Dollar prize (that was withheld since none of the participants met the goal of 10% improvement): = =
      • Semantics of ratings - Movies are not a repeat game. What does a rating of 5.x mean?
      • Was the task set up well, especially,
      • was the evaluation set up appropriately?
      • Discuss "ground truth" and
      • offline vs online evaluation.
      • Netflix has defined their problem correctly and shown what results are achieved till now. However, the test data is not sufficient in fields. How would youyou have set this up instead?
      • What were the key learnings?
      • IP protection
      • PR campaign= = ----

Models BDA-11

Recommendation Modeling

Relational Modeling
Item to item filtering à Full Relational modeling
relationalModeling.JPG
relationalModeling.JPG


Dynamic Tracking

Behavioral targeting à Dynamic tracking BDA-56
Exploitation / Exploration
Active learning as data collection strategy

dynamicTracking.JPG
dynamicTracking.JPG


Role of time: 2 extreme views

Does the recommender system keep user state?

Statistical Relational Models

Assumptions

    • The observable behavior is shopping and buying.
    • Shopping
      • Building and maintaining product space awareness
        • Product attribute space (choices)
        • Value space (preferences)
      • Multiple sites, multiple sessions and multiple threads
    • The web contains a massive amount of structured observational data about shopping behavior
      • Data is not independent and identically distributed (IID)
      • Cant be exploited using standard IID assumptions
      • Makes statistical relational modeling important

Challenges

    • Large number of dimensions
    • Large domain sizes
      • Many pages at a site
      • Large product catalogs
    • Large number of interesting derived attributes
    • Many ways of accessing the data

Path

    • Joint Probability Distributions
    • Graphical Models: Bayesian Networks
    • Relational Models: Representation and Algorithms
    • Modeling Web Visitor Behavior
    • Generate Recommendations = =

Joint Probability Distributions

    • A joint probability distribution completely specifies a model
      • Can answer any questions!
    • Example: A door sensor that can detect whether someone is by the door
      Untitled2.png
      Untitled2.png
      • 2 variables - sensor output and state of the world
      • Create joint probability distribution between the two
      • Question: Is anyone there?
        • Prior = 0.0001 + 0.01
        • Posterior | sensor says no = 0.01 / (0.01 + 0.0099)
    • However, as the number of variables in the system grows the complexity increases exponentially
      • The size of the table is exponential in the number of parameters
      • Becomes very difficult to specify the variables, store them and compute probabilities
      • Solution is Bayesian Networks

Bayesian Networks

    • Directed Acyclic Graph (DAG) over a set of variables
    • Arcs between nodes signify the relationship between them
    • Missing arcs in the DAG imply a conditional independence in the probability distribution
    • Allows for a tractable definition of the model
      • Depends on the graph being sparse, which they generally are due to:
        • lack of data
        • human language being formulated in such a way that graphs are sparse
    • Efficient algorithms for learning the models and inference

For a detailed explanation of Bayesian Networks please see http://en.wikipedia.org/wiki/Bayesian_network.

Explaining Away


Untitled1.png
Untitled1.png

  • Earthquake or Burglary can cause Alarm to go off
  • Alarm will cause a Phone message to be sent to me
  • Earthquake can also cause a Radio report





  • No arc between Earthquake and Phone means - if you know whether Alarm went off or not, knowing whether Earthquake happened or not does not change probability of getting a Phone message.
  • The missing arcs imply conditional independence in the joint probability distribution.
    • Theorm: If you acquire the local conditional probabilities of the graph given only its parents, a set of numbers defined by the graph that are local, there is guaranteed to exist a single unique probability distribution over all variables and it can be recovered by taking the product of the conditional distributions.

Inference - My car won't start!
Untitled3.png
Untitled3.png

  • Prediction
    • Will it start?
      • The battery is charged and the fuel gage says it is full. Given this information will the car start?
  • Directions of the arcs is independent of the direction of inference
  • Diagnosis
    • Whats wrong?
      • My car does not start. Whats wrong with it? What are the probabilities that the battery is dead or the ignition system has failed?
  • Decision analysis
    • What should I do about it?


Probabilistic Relational Models

  • Originally, machine learning algorithms used a flat file data model.
    • imposition of IID across the rows and columns of the file
  • Bayesian networks relax the IID assumptions across the variables (columns of the dataset)
  • Probabilistic Relational Models relax the IID assumptions across the rows of the dataset.
    • Example: Three page views in a session log by same visitor.
      • These are not independent because they are views by the same visitor.
      • If you extract visitor out of the model and make it a seperate entry in the schema, there is one visitor that all three rows are related to.
      • The three rows are no longer independent, but they are IID conditioned on the visitor.


Recommendation Engine:


Algorithm
Select a subset of items from a given catalog in a given situation that optimizes an objective function. The selected items are then presented to the visitor.
Recommendation 1.0
Collaborative Filtering: Predictions about the interest of an individual are made based on the information collected from many individuals.
Example: Individual A likes to buy German cooking books. The majority of people who like to buy German cooking books also like to buy travel books
about Germany. Hence, we predict that individual A would also like to buy travel books about Germany.
Association Rules: Allow you to find rules of the kind: If X, then Y.
Example: If (Stat 252ClassAttendance=Always and Participation=EveryClass and Enthusiasm=High) then (Grade=A).
Recommendation 2.0
Cleverset
Assumptions
Architecture of Participation
Web 1.0: Information is pushed to the user by the website.
Web 2.0: Things are now about recommendation algorithms and architectures of user participation.
Transparency, Control and Content
Transparency: Expose auxilary infromation from a participating community that a recommender relies on.
What is the induced metric based on the visitor's behavior?
Control: How much influence does the customer have on the recommendation process and the associated parameters?
Content: Metadata: Annotation, markup, tags, etc.
Key understandings:
1) You don't own all of the data. The web is huge and a lot of additional data resides on the web.
2) The visitor, and not the company, is at the center. Hence, the thinking should ME business instead of E business.
3) A page is a framework for interaction.
4) A page does not equal content since content is dynamic. (cf Ajax)
• HW7
Use Cleverset and API to make recommendations on your site.

Implementation:


Website is deconstructed into pieces running across the internet
History
1) The deconstruction began in early 2000 with metric companies.
2) Thinking: It was "wasteful" for websites to do metrics since external metric companies could also do them.
3) By putting some javascript on the web page, the visitor's web browser could communicate with the metric companies.
4) As a result, IT departments and the CIO no longer had full control of the process.
Present Day
1) Thinking: It is "wasteful" for websites to be writing their own recommendation systems.
2) Like their predecessors (metric companies), javascript is put on the retailer's page/product detail pages so that every click is tracked.
3) Companies like Cleverset then populate various slots on the retailer's page with recommendations.
4) Retailer's like the idea since the process is simple to implement.
5) Deployment is now much easier vs. recommendation system deployments in the 90's
Potential Problems
1) Unlike metrics, the visitor experience can now be adversely affected if the recommendation system goes down.
2) If the retailer does not have a "spot" on the page for the recommendations, the retailer's page needs to be re-designed.
• Previous Recommendation Companies of the 90's
1) Required heavy server side integration
2) The lengthy six month itegration cycle with IT resulted in their demise.
3) Example: NetPerceptions (2003)

Present Day Risks:


• Catalog
1) Use empirical distribution=use what people go and see=>visitor spidering the site; RSS feeds
2) A recommendation system is of no value for the peak or the tail of the distribution. It's value lies in making recommendations for items found in the "middle" of the distribution.
3) New products in the catalog that have not been spidered cannot be recommended.
4) If the category structure is bad, the implementation can fail.
5) Problems can arise if the product is out of stock.
• Marketing
1) Highly branded companies are concerned about the appropriateness of the recommendations since inappropriate recommendations may tarnish their brand.
Example: Company XYZ does not want to recommend green shirts with pink shorts.
2) Marketers (traditional) are also concerned about losing control of the process.
• IT Department
1) Main concern is service (response time).
2) Does the money come from the IT dept. budget?
3) If the recommendation system goes down, current recommendation companies like Cleverset will substitute old recommendations in the slot.
4) Build or buy issue: IT departments should be buying recommendation systems, just like they buy database systems, etc.
• Privacy
1) Data older than n months should be destroyed since their value for prediction is greatly diminished.
2) Why? Example: An individual from Germany was denied entrance to the US because of the book titles contained in her Amazon.com "wishlist".

For further information on "Recommendation 2.0":
dambrosi@cleverset.com


Initial Contributors

***