EP 1346293 B1 20060628 - FINDING THE MOST INTERESTING PATTERNS IN A DATABASE QUICKLY BY USING SEQUENTIAL SAMPLING

Title (en)

FINDING THE MOST INTERESTING PATTERNS IN A DATABASE QUICKLY BY USING SEQUENTIAL SAMPLING

Title (de)

SCHNELLES AUFFINDEN DER INTERESSANTESTEN MUSTER IN EINER DATENBANK DURCH SEQUENTIELLE PROBENAHME

Title (fr)

PROCEDE PERMETTANT DE TROUVER RAPIDEMENT LES MOTIFS LES PLUS INTERESSANTS DANS UNE BASE DE DONNEES GRACE A UN ECHANTILLONNAGE SEQUENTIEL

Publication

EP 1346293 B1 20060628 (EN)

Application

EP 01960677 A 20010818

Priority

EP 01960677 A 20010818
EP 0109541 W 20010818
EP 00117900 A 20000819

Abstract (en)

[origin: US7136844B2] Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as n-best hypotheses problems where the goal is to find the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on confidence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility be the average (over the examples) of some function-which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worst-case sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worst-case bounds.

IPC 8 full level

G06F 17/30 (2006.01); G06F 17/18 (2006.01)

CPC (source: EP US)

G06F 17/18 (2013.01 - EP US); Y10S 707/99931 (2013.01 - US); Y10S 707/99934 (2013.01 - US); Y10S 707/99945 (2013.01 - US)

Citation (examination)

HANNU TOIVONEN: "Sampling Large Databases for Association Rules", PROCEEDINGS OF 22TH INT. CONF. ON VERY LARGE DATA BASES, 3 September 1996 (1996-09-03) - 6 September 1996 (1996-09-06), BOMBAY, pages 134 - 145

Designated contracting state (EPC)

AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

DOCDB simple family (publication)

WO 0217133 A2 20020228; WO 0217133 A3 20030626; AT E331990 T1 20060715; DE 60121214 D1 20060810; DE 60121214 T2 20070524; EP 1346293 A2 20030924; EP 1346293 B1 20060628; US 2005114284 A1 20050526; US 7136844 B2 20061114

DOCDB simple family (application)

EP 0109541 W 20010818; AT 01960677 T 20010818; DE 60121214 T 20010818; EP 01960677 A 20010818; US 34485203 A 20030407