Andreas Weigend
Stanford University
Stat 252 and MS&E 238

Data Mining and Electronic Business


Class 2

  • Topics addressed

    • Review of HW1

      • Search through local area second-hand car ads and look for ones satisfying some budget constraint
      • Model the points / number of comments relationship on a social bookmarking site (such as Reddit)
      • Getting movie genre information from IMDB, which is part of the attempt of doing netflix challenge.
      • Comparing relevant link densities in different wikipedia topics.
    • Search

      • Desktop
        • The old way: Search through files sequentially on hard drive, becomes more and more difficult as hard drives get larger and larger.
        • The new way: Build a reverse index from terms to files, scanning for certain strings or segmentations
        • It is easy for English where words are naturally segmented by spaces. More difficult for many Asian languages that have no explicit word boundary delimiter, e.g. Chinese, Japanese, Korean and Thai.
        • View typo as a new entry at the time of indexing. Later on, using usage statistics, we can figure out if people looking for one keyword end up looking at another keyword.
        • Desktop search engine software (such as Google, X1, Yahoo, Microsoft), get one installed and you will find things you never knew you had on your computers.
      • Intranet
        • The problem is to search through a network of computers typically within a firewall.
        • You still have the notion of some kind of a file system. So no big difference with desktop search.
      • Web:
        • Major difference with desktop or intranet search:
          • For desktop or intranet search, the paths to files are known;
          • For web search, no path or directory structure as above, this is replaced by the link structure. The link structure can be explored by a crawler.
            • Related CS issues: Depth/breadth of crawler; How often to update.
        • The key problem is not to find something, but to rank something.
        • Old days search: What happened before there were search engines?
          • 1. Guess or know the location on the web (e.g. name of page, URL)
            • Flaw: Have to know and type the URL into browser correctly
          • 2. Browse using directories
            • Organize manually using expert "surfers" (this is how Yahoo! came into existence)
              • Flaw: Does not scale, manual directories difficult to maintain
            • Organize by community of web users
              • Flaw: Difficult for tagging in the given structure
          • 3. Early search engine: (Infoseek, Lycos, Excite, AltaVista)
            • Crawl through web, following hyperlinks --> Extract words from the page --> Build index of web --> Match user input (search terms) to the index
              • Flaw: Users might not know the exact word or the spelling; Or users input one word, but the index use another one.
        • Some stats: March 2007 search engine stats from ComScore
    • Machine Learning: Ranking

      • New Problem: Relevance -- How to rank the page? What to show on top?
        • Improving relevance major effort at search engines today - for example large groups at Google and Yahoo devoted to this
      • Sources of relevance (in organic search results): What information can be used to help with this?
        • Within Page:
          • Location of search term on page
          • Number of occurences of search term on page
          • Metatags with keywords describing the content of your page
        • Static: Link Structure (e.g. Number of hyperlinks going into page)
          • Pagerank is a property of the page, which does not depend on the search term. It is a property of the page only, that depends on the link structure of that page from other pages (number of links going into the page, which is a measure of expressed attention to that page - Artificially created link farms however are ignored by the Pagerank algorithm).
          • Other people's links are both intention data and attention data.
          • Leverages other websites: your website is more important if a lot of (important) sites think your website is important (linking to your site)
          • Can be spammed by link farms, becomes a constant cycle of trying to cheat (search engine optimization) and preventing people from cheating (search engines)
        • Dynamic: Click Behavior
          • The search summary displayed for each search result is what prompts a user to click on that page. If a particular link, say the 4th in the search results is clicked on more often than the others, it is given a higher Pagerank.
          • Choice within a set of links
          • Collective action of searchers moves results up or down
          • Understand overal trajectory (e.g. for typos)
            • Related Question: What information does the user see?
          • Leverages users attention data
      • Interesting facts:
        • One-third of all sessions at Amazon.com are single-hit sessions. But if you look at the number of clicks in those sessions, its only 1% to 2% of the total clicks that Amazon.com receives. So in the online world, clicks are more important than the notion of visitors.
        • When the travel reservation company PrecisionReservations.com changed their domain name to Agoda.com, the numbers of hits they received went down from tens of thousands a day to a few hundreds a day. PrecisionReservations usually appeared at the top of the search result pages, but Agoda.com was nowhere to be found. This is an example of how search engines play an important role in online business ecosystem.
        • You can get the older versions of a website using Archive.org
        • The number of web searches per day is in the order of a billion. The number of emails per day is also in the order of a billion. If you just look at China, the number of SMS messsages per day is in the order of a billion.
    • Vertical Search

      • Extracting information from the so-called "deep web" which are the underlying databases of websites and making them available to the users is called vertical search. For instance, the prices of a particular model of a digital camera won't show up in a usual web search; for that we have to do a vertical search.
      • Vertical searching is relevant in the following fields:
        • Insurance policies
        • Shopping comparison
        • Real Estate
        • Cars
        • Travel
    • Monetization: Advertising

      • Why advertising?
        • Consumer Surplus
        • Connect merchants with prospective customers
        • Monetize consumer services
      • Why online advertising?
        • World-wide advertising is about the order of 400 billion dollars, and online advertising is only about order of 10 billion, but 30% of people's time in spent online.
        • Yet only 4% of advertising spending
        • The biggest advantage for online advertising is its measurability. Advertisers can measure how effective their ad campaigns are and determine their ROI as compared to traditional media (tv, radio or newspaper and magazines) where the ad was mass circulated but there was no way of knowing how effective it was.
        • Audience targeting capability is another advantage online adverising has over traditional advertising. Whereas, traditional media can only do limited audience segmentation (e.g., car magazines), online advertising has the potential to target user demographics, geographics and behavioral aspects. This is very difficult for traditional media advertising.
        • As more and more content becomes digitized, online advertising will move over to books and magazines. Cell phones and video game consoles are just starting to explore online advertising. Software funded by advertising looks like a possibility now.
        • Self-Selection
          • Web 1.0 - push advertising (e.g. banner ads)
          • Web 2.0 - pull monetization based on users' revealed interests, preferences
            • Users actively seek out material of interest
        • Business Models:
          • Pay upon transaction -- CPA: Cost per action(purchase, transaction, registration)
          • Pay when clicked -- CPC: Cost per click (Easiest to verify)
          • Pay by impression -- CPM: Cost per mille (French word for "thousand") impressions
          • Rent the space for a flat fee (fixed amount)
        • There are broadly two different market segments where these pricing schemes become more relevant:
          • Display Ads business (Banner ads). This is where the big brand advertisers come into play. Primary pricing scheme is eCPM (Effective CPM) where CPC and CPA are converted to equivalent CPM. This is not auction driven market, instead it falls into guaranteed business and makegood and bonuses are pretty common. Contextual advertising also falls in this category. The publisher has to determine the pricing strategy for its assets. Pricing is complicated because, along with network location and historical sell-through rates, one has to deal with targeting option selected, discounts offered to premium customers etc.
          • Search or text advertising. This is primarily driven off of auction and hence the burden of putting the best price shifts to advertisers. Primarily a self serve advertising model where small advertisers play equal part as the big brand advertisers.
      • Google AdWords
        • Decisions to be made by advertiser
        • Decisions to be made by publisher
        • Why does it work?
          • Fine-grained targeting
          • Measurable user behavior
          • Transparency and feedback
      • Economics: Keywords/Auctions
        • Second price auction
        • The burden of determining how much to pay for keywords is shifted to the advertisers.
        • Decisions to be made by advertizers
          • Language, geography to be shown
          • Test of the ad
          • Keywords to buy

    • Social Search (by Jan Pedersen - Chief Scientist, Yahoo! Search and Ads)

      • Definition
        • Wikipedia: search experience where relevance affected by audience input
        • Jan Pedersen: really about content generation - getting users to submit content that's available for searching
      • Category
        • Broadcasting Search (like Yahoo! Answers)
        • Target Search
      • Yahoo! Answers
        • How it works
          • Users can ask, answer a question, or browse questions and answers
          • Those asking questions can rate answers
          • Points earned for asnwering or rating questions: gives site a gaming aspect and an immediate "value proposition"
      • Strategic importance
        • Experience in Korea (Naver) and Taiwan (Yahoo) shows strong correlation with knowledge search use and web search use
          • Yahoo's big bet is that this experience repeats in the US
        • US Web Search Share
        • Websearch Industry Landscape
          • Two Major Levers to Gain Share: Superior quality and distribution
          • Mechanisms to Lock-in Share: Differentiated content, superior monetization
          • In-Country Leader Has Advantage: US (G), KR (Naver), CN (Baidu)
          • Significant Barrier to Entry for New Players
          • Brand and User Habits Dominate: Users extract utility from existing services
      • Yahoo! Answers: A high-potential game changer
        • Consumer Need: Create the world's largest platform enabling communities of people to share experiential and tail knowledge
        • Competition: MSN Q&A, Linked-In, Amazon Askville, Naver, Baidu
        • Game Changing Potential:
          • Create a critical mass of content and a community of knowledge enthusiasts
          • Change the search experience
      • Direction.
        • Where will this lead? Examples.
          • DJ’s ask for Answers as part of their show programming
          • Local listings contain answers level credentials
          • Celebs, thought leaders, give out urls for more on their views & knowledge
          • Professionals quote Answers levels as credentials
          • Pop culture mentions Answers frequently and in the context of broader issues – not as a promotion
          • Answers becomes the epicenter of social debate and fact exchange around key issues & political events
        • Why is this game changing?
          • Person with questions: ask answers
          • Person with knowledge: share knowledge
          • Connection between ask and share: family/friends, google/yahoo <=> Wikipedia, myspace, blog...
        • Two Learnings
          • Expertise is in the tail. Example: top answers in pregnancy & parenting
          • It's about the people.
        • Social Networks
          • Build your knowledge network by connecting with the people you trust and the topics you care about
          • Benefits:
            • a more personal experience
            • a more productive experience
            • (faster, easier access) to more useful, helpful, relevant information
        • Tail content and Micro-Communities
      • Open questions on Social Search
          • Will this prove an attractive option for most searchers?
            • For a given user, most questions not that interesting, most opportunities to answer outside domains of expertise
            • For at least a segment of users, the answer seems to be "yes": Yahoo Answers is successful so far
          • Who owns the content?
            • Contributor owns content, website (e.g., Yahoo) owns representation (like Yellow Pages?)
            • Some sort of Creative Commons arrangement (like Flickr?)
            • Public Domain (we all own it?)
          • Why does one community succeed, another one fail?
          • How important are the initial conditions?
          • What about the wrong answer?
  • Insights gained

    • Search is the starting point of everything. The agoda.com example showed that if you are not indexed by a search engine, you don't exist on the internet. This explains why Google is such a dominant force on the internet today.
    • Searching as an activity is rooted in the intention economy, but addition of attention data (Pagerank algorithm) was a game changer
    • Ranking is an very important factor in the web searching. Actually, it's almost impossible to do searhing without ranking. Statistical methods and machine learning knowledge are widely used to evaluate the relevance of the different pages and linkings. In the real world, to set the web pages on the top ranking is not simply lucky. There are lots of things that have to be done by any web master if he wants to have a very good search engine ranking. First the content should be quality content and you should be updating it quite regularly. You should have keyword rich content. You should first do a lot of research in finding out keywords which your prospective customers will be using while searching on the internet. Then you have to get content accordingly. This can immensely help in improving search engine ranking. But, never use content which is only focusing on keywords and have no substance. You have to strike a good balance between the both.[*]
    • Online advertising is a recent phenomenon. It is growing with and facilitating the develpment of the internet searching.Prices of Web-based advertising space are dependent on the "relevance" of the surrounding web content and the traffic that the website receives. Some more efficient methods are needed to measure the price of the online advertising.
    • The biggest advantage for online advertising is its measurability. Advertisers can measure how effective their ad campaigns are in almost real time(daily or hourly feedback) where tranditional advertising media (tv, radio or newspaper and magazines) it is difficult to know how effective it was and also the feedback cycle is extremely long.
    • Social search changes the nature of searching - now interactive
      • Leverages collective intelligence, enables search to answer questions beyond those within the competence of algorithmic search
      • Can create value for both users and search provider if the model is right
      • All this depends on "quality" content being identified and bubbled up
  • Summary of links to resources and examples mentioned in class

  • Further

  • Readings·

  • Homework

  • Initial contributors