Introduction to KDD for Tony's MI Course

Introduction to KDD for Tony's MI Course

1 COMP3503 Intro to Inductive Modeling with Daniel L. Silver 2 Agenda Deductive and Inductive Modeling Learning Theory and Generalization Common Statistical Methods

3 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing Data Consolidation Data Warehouse Consolidated Data

Data Sources p(x)=0.02 Patterns & Models Prepared Data 4 Deductive and Inductive Modeling Induction versus Deduction Top-down verification Deduction Model or General Rule

Example A Example B Example C Induction Bottom-up construction 5 6 Deductive Modeling Top-down (toward the data) verification of an hypothesis The hypothesis is generated within the mind of the data miner

Exploratory tools such as OLAP and data visualization software are used Models tend to be used for description 7 Inductive Modeling Bottom-up (from the data) development of an hypothesis The hypothesis is generated by the technology directly from the data Statistical and machine learning tools such as regression, decision trees and artificial neural networks are used Models can be used for prediction

8 Inductive Modeling Objective: Develop a general model or hypothesis from specific examples Function approximation (curve f(x) fitting) x Classification (concept learning, pattern recognition) A

x2 B x1 9 Learning Theory and Generalization Inductive Modeling = Learning Basic Framework for Inductive Learning Testing Examples Environment

Training Examples (x, f(x)) Inductive Learning System Induced Model or Hypothesis ~ f(x)? h(x) = A problem of representation and search for the best hypothesis, h(x). Output Classification

(x, h(x)) 10 Inductive Modeling = Data Mining Ideally, an hypothesis (model) is: Complete covers all potential examples Consistent no conflicts Accurate - able to generalize to previously unseen examples Valid presents a truth

Transparent human readable knowledge 11 12 Inductive Modeling Generalization The objective of learning is to achieve good generalization to new cases, otherwise just use a look-up table. Generalization can be defined as a mathematical interpolation or regression over a set of training points: f(x) x

13 Inductive Modeling Generalization Generalization accuracy can be guaranteed for a specified confidence level given sufficient number of examples Models can be validated for accuracy by using a previously unseen test set of examples 14 Learning Theory Probably Approximately Correct (PAC)

theory of learning (Leslie Valiant, 1984) Poses questions such as: How many examples are needed for good generalization? How long will it take to create a good model? Answers depend on: Complexity of the actual function The desired level of accuracy of the model (75%) The desired confidence in finding a model with this the accuracy (19 times out of 20 = 95%) 15 Learning Theory

- - c + + - Where c and h disagree - h +

- - - Space of all possible examples The true error of a hypothesis h is the probability that h will misclassify an instance drawn at random from X, error(h) = P[c(x) h(x)] 16 Learning Theory Three notions of error:

Training Error How often training set is misclassified Test Error How often an independent test set is misclassified True Error How often the entire population of possible examples would be misclassified Must be estimated from the Test Error 17 Linear and Non-Linear Problems Linear Problems

Linear functions Linearly separable classifications Non-linear f(x) x A x2 x1 Problemsf(x) Non-linear functions Not linearly separable classifications

B B A B x2 x1 18 Inductive Bias Every inductive modeling system has an Inductive Bias Consider a simple set of training

examples like the following: f(x) x Go to generalize.xls 19 Inductive Bias Can you think of any biases that you commonly use when you are learning something new? Is there one best inductive bias? 20

Inductive Modeling Methods Automated Exploration/Discovery Prediction/Classification

A e.g.. discovering new market segments x2 distance and probabilistic clustering algorithms x1 e.g.. forecasting gross sales given current factors statistics (regression, K-nearest neighbour) artificial neural networks, genetic algorithms f(x) Explanation/Description B

e.g.. characterizing customers by demographics inductive decision trees/rules rough sets, Bayesian belief nets if age > 35 and income < $35k then ... x 21 Common Statistical Methods 22 Linear Regression Y = b0 + b1 X1 + b2 X2 +... The coefficients b0, b1 determine a

line (or hyperplane for higher dim.) that fits the data Closed form solution via least squares (computes the smallest sum of squared distances between the examples and predicted values of Y) Inductive bias: The solution can be modeled by a straight line or hyperplane 23 Linear Regression Y = b0 + b1 X1 + b2 X2 +... A great way to start since it assumes you are modeling a simple function

Why? 24 Logistic Regression1 0 Y Z Y = 1/(1+e-Z) Where Z = b0 + b1 X1 + b2 X2 + Output is [0,1] and represents probability The coefficients b0, b1 determine an S-shaped non-linear curve that best fits data The coefficients are estimated using an iterative maximum-likelihood method Inductive bias: The solution can be modeled

by this S-shaped non-linear surface 25 Logistic Regression1 0 Y Y = 1/(1+e-Z) Where Z = b0 + b1 X1 + b2 X2 + Z Can be used for classification problems The output can be used as the probability

of being of the class (or positive) Alternatively, any value above a cut-off (typically 0.5) is classified as being a positive example A logistic regression Javascript page 26 THE END [email protected] 27 Learning Theory Example Space X(x,c(x)) x = input attributes c = true class function

(e.g. likes product) h = hypothesis (model) c - Where c and h disagree h + + - - The true error of a hypothesis h is the probability that

h will misclassify an instance drawn at random from X, err(h) = P[c(x) h(x)] 28 Generalization PAC - A Probabilistic Guarantee H = # possible hypotheses in modeling system = desired true error, where (0 < < 1) = desired confidence (1- ), where (0 < < 1) The the number of training examples required to select (with confidence ) a hypothesis h with err(h) < is given by 1 |H | m (ln )

Recently Viewed Presentations

  • ABECEDA RAUNALA PRIKAZ BROJEVA I ZNAKOVA U RAUNALU

    ABECEDA RAUNALA PRIKAZ BROJEVA I ZNAKOVA U RAUNALU

    (2) zapis sa stalnom točkom (ili fiksnim zarezom) Koliko je? 0,00000000000000000000000011 1100000000000000000000000 2,3E+14 2,3E-14 Tehnika kliznog ili pomičnog zareza 230000000000000(10)=2,3·1014 0,000000000000023 (10)=2,3·10-14 PRAKTIČNIJE 2,3 10 -14 mantisa baza eksponent Binarni brojevi i množenje s 2n i 2-n Binarni broj se...
  • Dosage Calculations - Wild Apricot

    Dosage Calculations - Wild Apricot

    We do not have any conflicts of interest or disclosures to make. No medications will be discussed off-label. Learning Objectives. Demonstrate the importance of calculations in pharmacy practice. ... Very severe, or endstage kidney failure (sometimes call ...
  • Glencoe Geometry - Ms. R&#x27;s Math Website

    Glencoe Geometry - Ms. R's Math Website

    If so, state the postulate or theorem that justifies your answer. Answer: Since 1 is not congruent to 4, line a is not parallel to line c by the Converse of the Alternate Interior Angles Theorem. 1 and 4 are...
  • ACM Information Systems SIG Leader Orientation Thursday, September

    ACM Information Systems SIG Leader Orientation Thursday, September

    ACM Information Systems SIG Leader Orientation Thursday, September 22, 2011 ACM Information Systems SIG Leader Orientation Thursday, September 22, 2011 ACM Information Systems 2 Penn Plaza, New York, NY 10121 [email protected] 212-626-0579 Technical Support Specialist Orlando Bovell [email protected] 212-626-0662 Web...
  • Data Science for High-Throughput Sequencing Lecture 1 Instructor:

    Data Science for High-Throughput Sequencing Lecture 1 Instructor:

    NET-Seq: L. Stirling Churchman and Jonathan S. Weissman, " ... Tn-seq; High-throughput Parallel Sequencing for Fitness and Genetic Interaction Studies in Microorganisms, ...
  • What is SEN?

    What is SEN?

    Thursday 7th June - Gareth D. Morewood - 'Autism (part 2)'. Please let Janet know what you would like Gareth to include in his talk. Cost: £3 national Patoss, £5 Shropshire Patoss, £7 guests. ...
  • Systems Basics: Roots of the Systems Movement

    Systems Basics: Roots of the Systems Movement

    Second Order Cybernetics Von Foerster, Cybernetics of Cybernetics, 1974 Maturana and Varela, Autopoiesis and Cognition, 1980 Bateson, Steps to an Ecology of Mind, 1972 Mind and Nature, 1979 "Cybernetics is the study of form and pattern" Observer Integral Part of...
  • Energy Primer OL - Figures

    Energy Primer OL - Figures

    Per Capita Useful Energy in 2005. Figure 2.7. Global Energy Flows 2005. Figure 2.8. Primary, Final, and Useful Energy in 2005. Figure 2.9. Growth in UK Energy Services(measured by final energy inputs) Figure 3.1. Drivers of UK Energy Service Demand...