Andreas Weigend
Stanford University
Stat 252 and MS&E 238

Data Mining and Electronic Business


Class 4 (April 30): Guest lecturer Bill Tancer, GM of Research, Hitwise

  • Summary of the Class

    • Hitwise is a company that provides aggregated data on internet usage to its clients
    • Data is collected from ISPs and an opt-in panel
    • This dataset can be used for:
      • Correcting market inefficiencies
      • Gathering information about competitors
      • Following up on advertising campaigns
      • Predicting trends in a number of domains
    • The goal of this week's project is to use the Hitwise data to produce a measure of consumer confidence
  • Hitwise Methodology

    • Market research through web usage data
    • 12,000+ clients worldwide
    • 10 million users in the U.S., 25 million worldwide
      • Of the 10 million US users, 7.5 million are via ISP Providers
      • Other 2.5 million opt in panel
    • Opt in panel information is much more thorough
      • Have more demographics on panel, age, gender, income, etc.
      • Cross reference with the rest of the 7.5 million
    • Leverages Claritas' PRIZM to segment the user base demographically
      • Based on ZIP+4 information (cluster of several households)
      • Example: "Shotguns and Pickups"
      • Useful for predicting new internet trends (i.e. early adoption by specifically "technorati")
    • Competitors:
      • Differentiated by methodology:
        • Panel-centric
          • 10,000-2mm users
          • Information on top sites only
          • Primarily for advertising decisions
          • Report monthly
          • Examples :
        • Site-centric
*
* Monitors traffic to specific site
          • Marketing and technical information
          • No comparative data
          • Report Daily, sometimes hourly
          • Examples:
        • Network-centric
          • Information on range of customers
          • Report Daily
          • Examples
            • Hitwise
    • Hitwise tries to combine all three models to have the depth and breadth to match up with competitors
  • Examples of Data Mining for Insight

    • Search Term Overview
      • Top search terms tend to be brand names and navigational terms
        • Queries containing “myspace” accounted for ranks 1, 2, 6, 8, and 9 of the top 20 search terms in the chart shown during class
        • Many of the top search terms are domain names
          • Rank 6: “www.myspace.com”
          • Rank 14: “www.yahoo.com”
      • Top 1000 search terms follow “Zipf’s law”
        • The frequency of a search term is inversely proportional to its rank in the frequency table
        • This suggests that the distribution of search term usage resembles the distribution of word usage in general
    • Seasonality
      • Time series of search frequencies can be surprising and reveal inefficiencies in the marketplace
        • E.g. "prom dress" searches peak in early January (rather than expected March/April)
          • informed clothing stores to start their displays early
          • turned out this effect was likely caused by teen magazines marketing decision to prolong prom dress advertising season
          • Peak in January falls within "Lifestyle - Fashion" category while peak in March in "Shopping & Classifieds - Department Store" category
      • Consumers generally reveal their interests online before in stores
        • E.g. 10x spike in engagement rings searches in week before Thanksgiving while jewelers say they get most business between Thanksgiving and Christmas
      • Year-to-year comparisons reveal striking similarity in "market share of visit patterns" between years -- enables good predictions for future years
        • E.g. Interest in dieting peaks January 1 (New Year's resolutions...). Low point: Thanksgiving
        • However, there appears to be a drop in wedding dresses and engagement rings this year...
      • Reality TV and advertising
        • "Google Pontiac" ad
          • Led to clear increase in Google searches for "Pontiac," but 15% of users visited Mazda's comparison site
          • Interesting to note that people follow directions -- no difference in Yahoo! searches for "Pontiac" (similar observation for "Yahoo Special K")
    • Negatively correlated word pairs
      • "online poker" and "sports book" -- suggests fixed quantity of gamblers distributed between two games
      • "boots" and "sandals" -- seasonal effect
    • More examples in economics
      • Online behavior can provide insight into consumer sentiment before leading indicators
        • E.g. Unemployment indicators
          • Unemployment claim number published monthly by Department of Labor takes two weeks to collect data and compile for release
          • Hitwise can track immediate changes in visits to unemployment sites or unemployment-related search terms
            • Possible to predict whether unemployment will increase or decrease at around 85% accuracy
      • Hitwise data can highlight discrepancies between theoretical expectations and actual consumer behavior
        • E.g. Gas prices
          • Gas prices account for roughly 6-8% of household budget, so spikes in gas prices aren’t theoretically expected to seriously affect consumer spending. Instead, shift spending online
          • Hitwise data demonstrates otherwise: when gas prices go up, online retail traffic decreases, and when gas prices go down, online retail traffic increases
          • Gas prices don't just have a substitution effect, they also effect consumer sentiment. When gas prices hit a hight, people Google for the gas price, it has symbolic importance to them.
          • The negative correlation with gas prices fails to hold for certain luxury retail sites
    • Hitwise data must be interpreted carefully, especially when trying to determine user intent
      • E.g. Predicting reality TV winners
        • Stacy Keibler Effect
          • Stacy Keibler was a final contestant for the show Dancing with the Stars in 2006
          • Search terms for Stacy Keibler occurred 10 time more frequently than those of her primary competitors, indicating she might win the phone-in popularity contest
          • Instead of winning first place, Stacy Keibler placed third
          • This miscalculation led to the creation of the “Stacy Keibler Correction Coefficient” (SKCC)
            • Search terms should be augmented with other information to judge user intent -- a large proportion of Stacy Keibler searches were performed primarily for pictures (e.g. nude pics)
        • American Idol season 5
          • Taylor Hicks searches were more focused on his musical ability than searches for Katharine McPhee, confirming Taylor Hicks’ lead
      • E.g. Home sales
        • Home sales decreased despite the increase in search terms like “homes for sale”
        • The search term increase was attributed to “vanity searches” -- people were concerned about estimating their home’s value, not looking for new homes to buy
      • E.g. YouTube age distribution
        • Initial results showed that YouTube was dominated by older users
        • These results were gathered during Spring Break, college students were using their parents computers!
    • Hitwise provides useful insights about the nature of Web 2.0
      • Is Web 2.0 mainstream?
        • 2 years ago 2% of internet users visited Web 2.0 sites, today 12% do
        • Wikipedia visits outnumber Encarta visits 3400:1
      • Are Web 2.0 users mostly passive, or do they actively create conent?
        • .16% of YouTube visits are uploads
        • .2% of Flickr visits are uploads
        • 4.56% of Wikipedia visits are edits, it is easier to edit text than to make content
      • What is the age distribution of Web 2.0 participants?
        • Older people participate more relative to younger people
          • 26% of 18-24 year olds only view Wikipedia, only 4% both view and edit it
          • 19% of 45-54 year olds only view Wikipedia, 29% both view and edit
          • 30.55% of 18-24 year olds only view YouTube, 2% both view and upload
          • 17% of 45-54 year olds only view YouTube, 30.56% both view and upload
        • Can we predict Web 2.0 winners
  • Using Hitwise

    • Chart
      • Compare Websites: measures percentage of internet usage by specific site
      • Compare Industries: measures percentage of internet usage by "industry"
        • Example: "Automotive - Manufacturers" includes websites of GM, Honda, etc.
      • Compare Search Terms: measures share of a search term driving traffic to all web sites
    • Profile
      • Website Details: detailed analysis of many popular websites
      • Rankings: rank of a website amongst competitors
        • Shows movement within a category
      • Clickstream: where users visited before and after visiting a specific site or sites within a category
      • Search Terms: search keywords driving traffic to specific sites or sites within a category
      • Demographics: breakdown of visitors to a site or category by sex, income, age, location
      • Lifestyle (not available to us): breakdown of visitors by lifestyle (see PRIZM)
    • Search Intelligence
      • Search Term Variations: shows common search term variations of a particular keyword or phrase
      • Search Term Analysis: breakdown of a search term by:
        • Popular websites receiving traffic
        • Share of total internet searches
        • Related search terms
        • Search engine share
        • Industry traffic share
      • Search Term Gap Analysis: compares differences in keyword share between sites or industries
      • Search Engine Analysis: market share of top search engines
      • Search Term Portfolios: analysis of search terms within a custom portfolio
    • Find
      • Websites by Description: Looks for sites that include specific terms in their URL or site description
      • Websites by URL: Looks for sites that include specific words in their URL
      • Websites by Demographic Composition: Looks for sites popular within a specific demographic category
      • Media Mentions by Keyword: Looks for sites recently mentioned in the media by specific keywords within articles
  • Project

    • Make a proposal for a new index of consumer confidence based on Hitwise data
    • Currently used: Consumer Board CCI, Michigan Consumer Sentiment Index, ABC News Consumer Comfort Index
    • CCI based on 100 point scale, normalized to 1985
    • Calculated from a survey of 5,000 people who are asked to rate the following topics as positive, negative, or neutral:
      1. Business conditions now
      2. Business conditions in 6 months
      3. Employment conditions now
      4. Employment conditions in 6 months
      5. Family income in 6 months
    • Why we can do better than the CCI
      • The CCI will overrepresent old people
      • People suffer from cognitive dissonance
      • Survey data is less reliable than observed data
        • 16% of internet travel goes to adult site, but less people claim to visit adult sites
  • Insights Gained

    • Can use data to make useful predictions
    • Search term data reflects what’s on people’s minds, providing valuable insights that are difficult to obtain using surveys or other conventional means (in particular, observing what people actually do is much more effective than merely asking them what they do or believe)
    • Nevertheless, search term frequency alone does not tell the whole story behind user intent; we should augment search term frequency data with other data to help pinpoint the intent behind a search term query
      • Analyze co-occurrences of popular search terms with other terms (e.g. “Stacy Keibler” and “pics”)
      • Analyze clickstream data to see which sites people visit based on their search term queries
    • People search online before they buy (e.g. engagement rings) -- hence, search data can be valuable for making short-term predictions

    External Links

>
**

Hitwise