Data mining

From Wikipedia, the free encyclopedia

(Redirected from KDD)
Jump to: navigation, search

Data mining is the principle of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but it is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods. It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data"[1] and "the science of extracting useful information from large data sets or databases".[2]

Contents

Although the term "data mining" is usually used in relation to analysis of data, like artificial intelligence, it is an umbrella term with varied meanings in a wide range of contexts.

Data mining is considered a subfield within the Computer Science field of knowledge discovery. Data mining is also closely related to applied statistics and its subfields descriptive statistics and inferential statistics.

The term "data mining" is often used incorrectly to apply to a variety of unrelated processes. In many cases, applications may claim to perform "data mining" by automating the creation of charts or graphs with historic trends and analysis. Although this information may be useful and timesaving, it does not fit the traditional definition of data mining, as the application performs no analysis itself and has no understanding of the underlying data. Instead, it relies on templates or pre-defined macros (created either by programmers or users) to identify trends, patterns and differences.

A key defining factor for true data mining is that the application itself is performing some real analysis. In almost all cases, this analysis is guided by some degree of user interaction, but it must provide the user some insights that are not readily apparent through simple slicing and dicing. Applications that are not to some degree self-guiding are performing data analysis not data mining.

Traditionally, analysts have performed the task of extracting useful information from recorded data. But, the increasing volume of data in modern business and science calls for computer-based approaches. As data sets have grown in size and complexity, there has been an inevitable shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools. The modern technologies of computers, networks, and sensors have made data collection and organization an almost effortless task. However, the captured data needs to be converted into information and knowledge to become useful. Data mining is the entire process of applying computer-based methodology, including new techniques for knowledge discovery, from data.[3]

Data mining identifies trends within data that go beyond simple analysis. Through the use of sophisticated algorithms, users have the ability to identify key attributes of business processes and target opportunities.

Although data mining is a relatively new term, the technology is not. Companies for a long time have used powerful computers to sift through volumes of data such as supermarket scanner data to produce market research reports. Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of analysis.

The term data mining is often used to apply to the two separate processes of knowledge discovery and prediction. Knowledge discovery provides explicit information that has a readable form and can be understood by a user. Forecasting, or predictive modeling provides predictions of future events and may be transparent and readable in some approaches (e.g. rule based systems) and opaque in others such as neural networks. Moreover, some data mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery.

Metadata, or data about a given data set, are often expressed in a condensed data mine-able format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.

'Data dredging' or 'data fishing' are terms one may use to criticize someone's data mining efforts when it is felt the patterns or causal relationships discovered are unfounded. In this case the pattern suffers of overfitting on the training data.

Data dredging is the scanning of the data for any relationships, and then when one is found coming up with an interesting explanation. The conclusions may be suspect because data sets with large numbers of variables have by chance some "interesting" relationships. Fred Schwed [4] said:

"There have always been a considerable number of people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it."

Nevertheless, determining correlations in investment analysis has proven to be very profitable for statistical arbitrage operations (such as pairs trading strategies), and correlation analysis has shown to be very useful in risk management. Indeed, finding correlations in the financial markets, when done properly, is not the same as finding false patterns in roulette wheels.

Some exploratory data work is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data dredging is less than clear.

Most data mining efforts are focused on developing highly detailed models of some large data set. Other researchers have described an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing simpler models that represent relevant data. [5]

When data sets contain a big set of variables, the level of statistical significance should be proportional to the patterns that were tested. For example, if we test 100 random patterns, it is expected that one of them will be "interesting" with a statistical significance at the 0.01 level.

Cross validation is a common approach to evaluating the fitness of a model generated via data mining, where the data are divided into a training subset and a test subset to respectively build and then test the model. Common cross validation techniques include the holdout method, k-fold cross validation, and the leave-one-out method.

There are also privacy and human rights concerns associated with data mining - specifically regarding the source of the data analyzed.

Data mining government or commercial data sets for national security or law enforcement purposes has also raised privacy concerns. [6] [7]

There are many legitimate uses of data mining. For example, a database of prescription drugs taken by a group of people could be used to find combinations of drugs exhibiting harmful interactions. Since any particular combination may occur in only 1 out of 1000 people, a great deal of data would need to be examined to discover such an interaction. A project involving pharmacies could reduce the number of drug reactions and potentially save lives. Unfortunately, there is also a huge potential for abuse of such a database.

Essentially, data mining gives information that would not be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.[8]

Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art i.e. pre-tablebase knowledge is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.

Data Mining is most frequently used for Customer Relationship Management applications. Common goals are to predict which people are most likely to: a) Be Acquired b) Be Cross-Sold or Up-Sold c) Leave \ Churn d) Be Retained, Saved, or Won back

These applications can contribute significantly to the bottom line. Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted.

More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to - across all potential offers.

Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer.

Business employing data mining quickly see a return on investment, but also they recognize that the number of predictive models can quickly become very large. Rather than 1 model to predict which customers will churn, we could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, we may only want to send offers to customers that will likely take to offer. And finally, we may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to 1) Manage model versions 2) Move to "Automated Data Mining."


Another example of data mining, often called the Market Basket Analysis, relates to its use in retail sales. If a clothing store records the purchases of customers, a data mining system could identify those customers who favour silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical or inexact rules may also be present within a database. In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem, will develop a secondary problem within the next 6 months.


  1. ^ W. Frawley and G. Piatetsky-Shapiro and C. Matheus (Fall 1992). "Knowledge Discovery in Databases: An Overview". AI Magazine: pp. 213-228. ISSN 0738-4602. 
  2. ^ D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge, MA. ISBN 0-262-08290-X. 
  3. ^ Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. ISBN 0471228524. 
  4. ^ Fred Schwed, Jr (1940). Where Are the Customers' Yachts?. ISBN 0-471-11979-2. .
  5. ^ T. Menzies, Y. Hu (November 2003). "Data Mining For Very Busy People". IEEE Computer: pp. 18-25. ISSN 0018-9162. .
  6. ^ K.A. Taipale (December 15, 2003). "Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data". Colum. Sci. & Tech. L. Rev. 5 (2). SSRN 546782 / OCLC 45263753. .
  7. ^ John Resig, Ankur Teredesai (2004). "A Framework for Mining Instant Messaging Services". In Proceedings of the 2004 SIAM DM Conference. .
  8. ^ Chip Pitts (March 15, 2007). "The End of Illegal Domestic Spying? Don't Count on It". Wash. Spec.. .
  9. ^ Stephen Haag et al.. Management Information Systems for the information age, pp 28. ISBN 0-07-095569-7. 

  • Kurt Thearling, An Introduction to Data Mining (also available is a corresponding online tutorial)
  • Dean Abbott, I. Philip Matkovsky, and John Elder IV, Ph.D. An Evaluation of High-end Data Mining Tools for Fraud Detection published a comparative analysis of major high-end data mining software tools that was presented at the 1998 IEEE International Conference on Systems, Man, and Cybernetics, San Diego, CA, October 12-14, 1998.
  • Mierswa, Ingo and Wurst, Michael and Klinkenberg, Ralf and Scholz, Martin and Euler, Timm: YALE: Rapid Prototyping for Complex Data Mining Tasks, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06), 2006.
  • Peng, Y., Kou, G., Shi, Y. and Chen, Z. "A Systemic Framework for the Field of Data Mining and Knowledge Discovery", in Proceeding of workshops on The Sixth IEEE International Conference on Data Mining (ICDM), 2006
  • Hari Mailvaganam and Daniel Chen, Articles on Data Mining

  • Peter Cabena, Pablo Hadjnian, Rolf Stadler, Jaap Verhees, Alessandro Zanasi, Discovering Data Mining: From Concept to Implementation (1997), Prentice Hall, ISBN 0137439806
  • Ronen Feldman and James Sanger, The Text Mining Handbook, Cambridge University Press, ISBN 9780521836579
  • Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining (2005), ISBN 0-321-32136-7 (companion book site)
  • Galit Shmueli, Nitin R. Patel and Peter C. Bruce , Data Mining for Business Intelligence (2006), ISBN 0-470-08485-5 (companion book site)
  • Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification, Wiley Interscience, ISBN 0-471-05669-3, (see also Powerpoint slides)
  • Phiroz Bhagat, Pattern Recognition in Industry, Elsevier, ISBN 0-08-044538-1
  • Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (2000), ISBN 1-55860-552-5, (see also Free Weka software)
  • Mark F. Hornick, Erik Marcade, Sunil Venkayala: "Java Data Mining: Strategy, Standard, And Practice: A Practical Guide for Architecture, Design, And Implementation" (Broché)
  • Weiss and Indurkhya, Predictive Data Mining, Morgan Kaufman
  • Yike Guo and Robert Grossman, editors: High Performance Data Mining: Scaling Algorithms, Applications and Systems, Kluwer Academic Publishers, 1999
  • Trevor Hastie, Robert Tibshirani and Jerome Friedman (2001). The Elements of Statistical Learning, Springer. ISBN 0387952845 (companion book site)

Advanced Search
Included Web Search Engines


Safe Search

close

Top Matching Results

Occasionally Search.com will highlight specialized results that are based on the context of your query. Examples of specialized results include specific links to news, images, or video.

Top Matching Results may highlight information from other Search.com pages, content from the CNET Network of sites, or third party content. The listings are based purely on relevance. Search.com does not receive payment for listings in this section but our partners that provide this data may get paid for listing these products.

Sponsored Links

This section contains paid listings which have been purchased by companies that want to have their sites appear for specific search terms and related content. These listings are administered, sorted and maintained by a third party and are not endorsed by Search.com.

Search Results

Search.com sends your search query to several search engines at one time and integrates the results into one list which has been sorted by relevance using Search.com's proprietary algorithm. You can customize the list of search engines included in your metasearch from the preferences.

The search engines that are used in your metasearch may allow companies to pay to have their Web sites included within the results. To view the Paid Inclusion policy for a specific search engine, please visit their Web site. Search.com does not accept payment or share revenue with any search engine partner for listings in this section.