Andreas Weigend
Stanford University
Stat 252 and MS&E 238

Data Mining and Electronic Business

Thomson Research Presentation

Views of Text Mining

  • Analogy with data mining
    • Discovering interesting patterns in text data
  • Precursor to data mining
    • Convert text data into relational data to be mined
  • Text mining is distinct
    • Creation of metadata universe (metaverse) that enables textual discovery of relationships, similarities, and trends

Professional Information Needs

  • Need for high quality search results
  • Need to collect information from multiple sources
  • Gathered under pressure of time
  • Used to support major decisions

Legal Publishing Domain Example

  • Tens of millions of documents
    • 7 million US appellate cases, 20 million abstracts
  • Millions of actors and other entities
    • Attorneys, judges, companies, organizations, courts, etc
  • Hundreds of thousands of users
    • Interested in completeness, demand high recall
  • Tendency to neglect secondary sources
    • Online searching has encouraged "fishing expeditions" in case law databases at the expense of secondary sources
    • Changed over the last decade
      • Used to begin with research using libraries, following references to gather information (Unspoken pact between publishers and law firms that this was the way research should be performed)
      • Now people dive right into the case at hand

Case Studies

  • People and organizations in the law
    • Web of relationships exists outside of conceptual universe
      • Legal concepts are well covered by taxonomies
    • Relationships and trends need to be linked to legal events
      • Every datum must have a document behind it
  • "How can users search all of Westlaw?" (Westlaw is Thomson's legal portal)
    • People were tending to dive right into case law instead of beginning with background research. The problem was figuring out how to get them to read more broadly.
    • Thousands of logical databases
    • Computationally expensive to search all full-text indexes
    • Might not work well, due to terminology differences
      • Languages of cases vs. language of statutes

Case Study 1: Firm360

  • Market analytics for law firms
  • Example for Microsoft (note the pie chart and how it shows litigations divided into types of litigation)
    • Reports on who represents whom
      • Attorneys, law firms, expert witnesses
    • By company, by industry, by type of case
      • E.g., "Who handles eBay's IP litigation
    • Intuitive graphical display and interface
      • Shows current market share, trends
    • Links to relevant case law documents
      • Also legal directory entries for attorneys
  • How Firm360 Works
    • Processing pipleline and Caravan
      • Runs constantly and takes in about 500 new cases each day
        • Cases are obtained directly from the courts. They are not owned by Thomson, but the metadata generated from them is.

    • Auto-extraction from "front matter" of cases. Front matter is the title of the case and other identifying characteristics.
      • Party names especially companies
      • Attorneys, roles
      • Courts, judges, dates, jurisdictions
    • Automatic categorization of entities
      • Companies by industry
      • Cases by type of suit
      • Attorneys by practice area
    • Linking both to and from documents
      • E.g., find every case an attorney has been involved in
    • Facts and Figures
      • over half-million attorneys
      • ~10k judges
      • 153k law firms
      • 58k companies

ResultsPlus Algorithm

  • Document recommendation systems to both broaden and focus searches
    • Draws upon wide range of materials
    • 10% of revenue comes from recommended documents
  • Improves quality and aids browsing
    • Goes beyond keyword search, leverages metadata
  • Highly successful in terms of both user satisfaction and revenues
    • Incorporates personalized ranking scheme
  • Recommender Systems (recap)
    • Rely upon history of user purchases and possibly ratings
    • Leverage product metadata
    • Quickly gain trust and credibility
    • No added performance overhead
  • Cold Start problem
    • What if you don't have much history?
      • Even in the best of times, user x product matrix is very sparse
    • What if you don't have much product metadata?
      • Long tail products will have less metadata
    • In the case of Westlaw, no records had been kept so there was no starting data to work with. Solved by starting with metadata and shifting algorithms to use user data as it became available.
  • Attempts at problem formulation
    • "People aren't using online analytical materials enough"
      • Materials need to cite the latest cases (at first they were not being updated fast enough)
      • Need to route new cases to sections of legal encyclopedia
    • "How can users search all of Westlaw (Thousands of databases)?"
      • Computationally expensive to search all full-text indexes
      • May not have good results due to terminology differences
        • E.g., language of cases vs language of statutes
  • ResultsPlus Solution
    • Recommend analytical materials next to case law searches
      • Initial version relied mostly on document metadata
      • Current version also uses click-through data to re-rank
    • Analytical materials already indexed by proprietary document classification process
      • Primarily, keyword and word pair (co-occurrence) index
      • Secondarily, uses any other taxonomic information available
    • Due to limited screen real estate, ranking is crucial
      • Initial version used ranking SVM on publication CTRs
      • Current version combines individual and global CTRs
  • Gathering Pool of Documents/Training Corpus
    • Trained on subset of text, such as summary paragraphs, from analytic documents (ex: encyclopedia article) with metadata from case summaries of cases that are cited in the document.
    • Indexed by using term co-occurrences (pairs of terms) with single terms are treated as paired with themselves
  • Results Fusion
    • Need to combine results from multiple sources
    • Threshold is set on scores that come from different publications, this gives each publication at least a chance of being shown (if the article scores well enough).
    • Original Method used a SVM method on the click through rate per publication.
      • Click through decision are major as documents may cost money to view so this determines how useful a publication is although some are free depending on the user's plan.
    • Current Method: Takes advantage of all the available information on both clients (main practice area of an attorney) as groups of clients (law firm).
      • Know quite a bit of information about the client such jurisdiction, main practice area, etc. as well as their personal history and preferences.
    • Have grouping information, if not enough information for a individual client then can use next higher level (law firm for example).
      • User to law firm to practice area to general user
      • Publication to pub type, etc.
    • Can now be smarter regarding how data is leveraged as there is more data around users.
  • ResultsPlus: Use of Ranking SVM
    • SVM seeks to minimize inverted document pairs in rankings
      • Each document represented as a feature vector (different CTR rates).
      • Have various types of information although only10 total features
    • Pair-wise training model:
      • Training data is based on randomly generated document samples
      • Rank documents with respect to their click through rates
      • Identify 'invertible' pairs of documents, ie: pairs of documents with different ranks.
    • Generating the Training Samples:
      • Present Randomized results to users
      • Randomness is important otherwise data is skewed, bias towards top positions
      • Done over a period of time on a small subset of users.
  • SVM Method
    • Change into a binary decision problem which the SVM can understand
    • Input is set of labeled difference vectors
    • Uses a linear kernel so features have to be monotonically increasing.
    • Model validated on held out sample
    • Takes preference into account at multiple levels if no data at lower levels
  • SVM and Returned Results
    • SVM learns preferred document in feature space irrespective of query
    • Initial ranking and thresholding by categorization ensure relevancy to query
    • SVM reranking ensures relevancy to user, original ranking is ignored
  • Performance of Original Experiment
    • Originally tested by running on internal pool of attorneys, given rankings and asked to grade them.
      • Returned 3 articles and 3 key numbers per query for gradding
      • Possible grades:
        • A (good): Very close to what was wanted
        • C(ok): Close to what is desired but not quite there
        • F(bad): Not what was wanted at all
    • Method not using CTR (prior to ResultsPlus): 80% had one A, 4% had all Fs
    • Currently using CTR: >99% at least one A, <1% all Fs
      • There are now 300 different types of documents to suggest from a pool of over 3 million
    • Best if using both local and global data.
  • Revenue Increase
    • Re-ranking using everything involved made $2.5 million extra revenue last year
    • Based on “out-of-plan” documents that must be paid for to view, other documents are free for the user to view.
  • Learnings about Metadata
    • Even non-machine readable meta-data, such as hand written notes, can be made useful and turned into a machine readable data. Especially if a good program is used to do the conversion.
    • Any meta-data is better than none, even things not assumed to help can help if there is enough data as the machine learner has more to work with.
      • Multiple sources are better than a single source
    • Strong transfer effects, same model can be used on different (but similar) data types if the original training data was of high enough quality.
    • One can build a pyramid of meta-data delivering more and more value to users.
    • Meta-data can be re-used in new projects, allows thinking of products and services that could not be used before.
    • Business effect that allows business people to better see possibilities of meta-data and how to leverage it.
      • Metadata is usually domain specific but lets you think new ideas
  • Metaverse Predictions
    • Will not be based on RDF, webmasters, etc.
    • Meta-verse will be better built my machines.
      • People will take too long, we are impatient, we want this functionality now.
      • Intellectual property trade off, sharing data is difficult to justify which goes against semantic web
    • Current metaverse people do not think stochastically but categorically. Data is too complex in the various realms for a non-probabilistic system to work well except as a starting point.
    • Textual discovery is knowledge discovery not encirclement.
  • Comments/Discussion on talk
    • Most money on internet is advertising but in Thomson's case most money is from buying data
    • Same order of documents as products at amazon but an order of two magnitude fewer users.
      • Here clients may not always be happy about sharing data. For example, they may want knowledge to grow within a law case but not have competition with opposing attorneys on same case.
    • In ResultPlus the ground truth is limited but available (grading) but in most cases you need to give people incentives to use it.
    • Similar to tagging in terms of meta-data extraction
    • Most of the metadata generated is proprietary but some open standards were also introduced (footnoted, page numbers, etc.)
    • Standard ways of referencing other documents. Programs exist that go through and analyze these links to other documents.
    • KeyCite: Citation network, more complex than simple linking: criticism, overruling, etc.
      • Cannot be used for starting point in document recommendation as we want to recommend non-directly related documents. Tried to use citations initially for ResultsPlus but found that user data overpowered document metadata.
      • ResultsPlus dropped many features with heavy evolution towards CTR popularity. The former was powerful but the later was even more powerful.

Spock Overview

  • Spock is a person-oriented search engine, rather than a keyword oriented search engine
  • Result pages have attributes of people, rather than hyperlinks to external websites
    • These attributes are called meta-data
    • Attributes are determined by vote
    • This makes them probabilistic, avoiding Wikipedia edit-wars
  • Cold start problem, how do we start and extract tags automatically? Have robots vote
People Search
  • Guest speaker: Hongche Liu from Spock, Chief Information Officer
  • Overview: Engineering centric view of "people search" using metadata

Spock Community Ranking Presentation

• Ingredients

What is searchable? Names, Locations, Tags, Domains
What is filterable? Gender, Age, Presence of Image

• How do you design Spock's Architecture?

  • Key Challenges
    • How do you allow for server restarts?
      • Use a shared-memory approach
    • How do you ensure quick access to the primary database?
      • Build replicas of the database
    • How do you quickly update the database
      • Needed because users want to see results of changes in real-time
  • Key Components

  • Profile Database
    • Holds the information about the names, tag, location, etc
    • Only accessed by the Web Server
  • Replica of Profile Database
    • Used because the ProfileDB is under a heavy load
  • Index Builder/Indices
    • Produces a set of indices
    • Every query is converted into an integer
    • Ever string query is stored in memory for a fast search
      • This is why it's cheap to store the internet, but expensive to build a search engine
  • SHM, SHM Creator
    • Shared memory (UNIX based)
    • This is what allows the server to restart
  • Search Aggregator:
    • Converts user-input into machine-readable format
    • Queries system for search-results, returns to the user

• How do we rank search results?

  • Features that can be ranked on
    • Term match - How much does "labrador dog owner" match "dog owner"?
    • Source authority - Where does the data come from?
      • MySpace data is time-dependent, less reliable
    • Community votes: Voting on tags is conducted by individuals, spock robot, etc.
    • Web Presence (term + name): What meaningful and relevant terms really apply to the individual?
      • If you call your friends to rank you as the best basketball player, should you be ranked higher than Michael Jackson?
    • Profile popularity: How popular is the profile? (Popularity is independent of tags, search terms, etc.)
    • Profile completeness: How many pictures, tags, etc. are in the profile?
    • Attribute weighting: How should we weigh someone that lives in San Francisco vs. someone that worked in San Francisco for a short period of time?
    • Vote authority (anti-spam): By computing the predicted reliability of an individual's vote, we can determine the appropriate weight to assign to the individual's vote.
    • Click through Rate (CTR) feedback
  • How does it rank?
    • Gives each person a score based on the search term
    • Only uses statistical features of meta-information
    • Does not use semantic understanding
  • How do you determine the score?
    • The formula is below
    • Slide7.JPG
    • For each person sums over all the matching tags
      • eg matching tags for the search term "American Cyclist" could be "American Cyclist," "American," and "Cyclist"
    • Each tags contribution to the score is the (Sij + Pj)/f(Fi) term above
      • Sij and Pj are pre-computed before any search is done
      • f is a montonicly increasing function (eg log x), F is the frequency of the tag
        • Logic, if the tag is popular, matching it doesn't mean much
      • P scores serve as a tie-breaker
        • Useful for "cold-start" phase, when you have little information to go on
      • Indexer finds Sij and Pj (see below)
      • First, here is an example

• Indexer

  • Decides which terms are similar to each other (finds Sij)
    • For example determines that "New York" and "New York City" mean the same thing
    • Uses the same formula as above
    • In this case "New York" and "New York City" have a large intersection, so they mean the same thing
  • Now you determine Sij, example below
  • Slide11.JPG
    • Normalize to lower the spread
    • # of web documents found by searching for documents that have both "cyclist" and "Floyd Landis"

• Profile completeness and popularity

  • This is the p in the formula above
  • Features
    • Images: Are images available? How many images are there?
    • Activeness: Is this an active user or a dead profile?
    • Number of profiles related to you: If people link to your site, you must be more important.
    • Number of viewers: How manhy unique viewers?
    • Number of times you are a favorite: Your popularity increases when others consider you as a favorite.
    • Number of tags: You are more popular if you participate on multiple sites (LinkedIn, Wikis, etc.)
    • Number of web references
    • Number of views: A large number of searches can indicate a high level of popularity
  • All of these counts are weighted using a non-linear function
    • Having 0 counts is very important compared to 1 count
    • Have 1000 counts doesn't mean much as compared to 500 counts

• Voting Issues

What do you do when people put negative tags on profiles they do not like? What type of algorithm can you apply to solve this problem?

Vote Authority
• Spock predicts an individual's future vote validity by using Bayesian statistics - specifically, Bayes Law.
• That is, previous tags by an individual are observed and recorded. If the tag is then followed by negative votes, the voting authority of the
individual creating the tag is decreased.
• Most people (95%) are good taggers.

• Tasting

How do we know that the ranking is any good?

Quantities of Interest
Abandon Rate: If people don't click on anything and immediately leave the site, the ranking must NOT be very good.
Average rank of a click-through: When they do click, which specific item does the individual click on when the search result is presented?
Average rank of positive and negative votes on the results page:
Individuals can vote on the ranking on the results page.
If there is very little voting, the ranking is probably pretty good.
If many votes are cast, individuals do not agree with the ranking and are trying to change the ranking.

• Take Home Lessons

  • Google can be beat in certain subfields
    • Google doesn't look at meta-data, the relationship between pieces of data can be important
      • eg What lawsuits has Microsoft been involved in
    • Completeness is important
      • Lawyers need all the information available, Google might be spotty, or results might be buried
  • Search is an art, not a science
    • How do you know to use a non-linear function for ranking popularity? There is no objective function, you have to guess

Initial Contributors

  • Yaron Greif
  • Earl Wong
  • Trent Peterson
  • Marcin Mejran