# Data Mining Methodology

- Describe three challenges to data mining regarding data mining methodology and user interaction issues.

**Mining different kinds of knowledge in databases-**different users may be interested in different kinds of knowledge. Therefore its necessary for data mining to cover a broad range of knowledge discovery task**Handling noisy or incomplete data-**the data cleaning methods are required to handle the noise and incomplete objects while mining the data irregularities. if the data cleaning methods are not there then the accuracy of the discovered patterns will be poor**Interactive mining of knowledge at multiple levels of abstraction-**the data mining process needs to be interactive because it allows the users to focus the search for patterns, providing and refining data mining requests based on the returned results.

**Data mining challenges**

These challenges are related to data mining approaches and their limitations. Mining approaches that cause the problem are:

- Versatility of the mining approaches
- Diversity of data available

- Dimensionality of the domain

- Control and handling of noise in data

**User interaction issues**

The knowledge discovered using data mining tools is useful only if its interesting to the user.

- Mining based on level of abstraction. Data mining process needs to be collaborative because it allows the user to concentrate on pattern finding, presenting and optimizing requests for data mining based on returned results
- Integration of background knowledge. Previous information may be used to express discovered patterns to direct the exploration processes and to express discovered patterns.

- Explain the requirements of clustering in data mining

- High dimensionality – the clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space
- Ability to deal with noisy data- databases contains noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.
- Interpretability- the clustering results should be interpretable, comprehensible and usable
- Scalability – highly scalable clustering algorithms are needed to deal with large databases
- Ability to deal with different kinds of attributes- clustering algorithms should be applied on any kind of data such as interval- based (numerical) data, categorical and binary data
- Discovery of clusters with attribute shape- the clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.

- Discuss the following classification methods:
- Bayesian classification- it is based on the bayes’ theorem. The Bayesian classifiers are used to predict class membership probabilities such as the probability that a given tuple belongs to a particular class

- Fuzzy set approach;- it is a theory or an approach which was proposed as an alternative of the two-value logic and probability theory. Fuzzy set approach allows us to work at a high level of abstraction.it also provides us the means for dealing with imprecise measurement of data and also deal with vague or inexact facts

- Genetic algorithm- it provides a comprehensive search methodology for machine learning and optimization. Generic algorithm is used in various fields of data mining to get optimized solutions for the better performance of the data that are required in decision making and process the accurate result.

- Explain the requirements of clustering in data mining
**Scalability**− We need highly scalable clustering algorithms to deal with large databases.**Ability to deal with different kinds of attributes**− Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data.**Discovery of clusters with attribute shape**− The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.**High dimensionality**− The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.**Ability to deal with noisy data**− Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.**Interpretability**− The clustering results should be interpretable, comprehensible, and usable.

Bayesian classification

It is based on Bayes theorem to predict occurrence of an event. Bayes theorem expresses how a level of belief is expressed as probability.

Bayes assumption is that each feature makes an independent and equal contribution to the outcome.

Bayes classifiers are used in document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters.

Fuzzy sets

In data mining data sets are composed of variables with exact values and strict boundaries.

Fuzzy sets represents the strict boundaries in vaguely manner. It specifies to which degree there is something and thus incorporate gradual transition.

Fuzzy data sets arise when fuzziness is used in data selection and preparation where intervals don’t have exact endpoints. This fuzzy data can be analyzed in fuzzy spaces.

Genetic algorithm

Genetic algorithms are adaptive and heuristic search algorithms that belong to evolutionary algorithms. They are based on ideas of natural selection and genetics.

They are used to generate high quality solutions for optimization problem and search problem.

Genetic algorithms stimulate survival for fitness among individual of consecutive generation for solving a problem.Each generation consists of a population of individuals of consecutive individuals and each individual represents a point in search space and possible solution

Some applications of genetic algorithm are:

- Learning fuzzy rule base
- Code breaking
- Recurrent neural network
- Mutation testing