Volume 6, april 2007, joint Special Issue ARIMA/SACJ on Advances in end-user data mining techniques

1. Progress of organisational data mining in South Africa

Mike Hart.

This paper describes three largely qualitative studies, spread over a five year period, into the current practice of data mining in several large South African organisations. The objective was to gain an understanding through in-depth interviews of the major issues faced by participants in the data mining process. The focus is more on the organisational, resource and business issues than on technological or algorithmic aspects. Strong progress is revealed to have been made over this period, and a model for the data mining organisation is proposed.

2. Motivic Pattern Extraction in Music, and Application to the Study of Tunisian Modal Music

Olivier Lartillo ; Mondher Ayari.

A new methodology for automated extraction of repeated patterns in time-series data is presented, aimed in particular at the analysis of musical sequences. The basic principles consists in a search for closed patterns in a multi-dimensional parametric space. It is shown that this basic mechanism needs to be articulated with a periodic pattern discovery system, implying therefore a strict chronological scanning of the time-series data. Thanks to this modelling global pattern filtering may be avoided and rich and highly pertinent results can be obtained. The modelling has been integrated in a collaborative pro ject between ethnomusicology, cognitive sciences and computer science, aimed at the study of Tunisian Modal Music.

3. One-Class Classifiers: A Review and Analysis of Suitability in the Context of Mobile-Masquerader Detection

Oleksiy Mazhelis.

One-class classifiers employing for training only the data from one class are justified when the data from other classes is difficult to obtain. In particular, their use is justified in mobile-masquerader detection, where user characteristics are classified as belonging to the legitimate user class or to the impostor class, and where collecting the data originated from impostors is problematic. This paper systematically reviews various one-class classification methods, and analyses their suitability in the context of mobile-masquerader detection. For each classification method, its sensitivity to the errors in the training set, computational requirements, and other characteristics are considered. After that, for each category of features used in masquerader detection, suitable classifiers are identified.

4. A Texture-based Method for Document Segmentation and Classification

Ming-Wei Lin ; Jules-Raymond Tapamo ; Baird Ndovie.

In this paper we present a hybrid approach to segment and classify contents of document images. A Document Image is segmented into three types of regions: Graphics, Text and Space. The image of a document is subdivided into blocks and for each block five GLCM (Grey Level Co-occurrence Matrix) features are extracted. Based on these features, blocks are then clustered into three groups using K-Means algorithm; connected blocks that belong to the same group are merged. The classification of groups is done using pre-learned heuristic rules. Experiments were conducted on scanned newspapers and images from MediaTeam Document Database

5. New Evolutionary Classifier Based on Genetic Algorithms and Neural Networks: Application to the Bankruptcy Forecasting Problem

M.A. Esseghir.

Artificial neural networks (ANNs) have been widely applied in data mining as a supervised classification technique. The accuracy of this model is mainly provided by its high tolerance to noisy data as well as its ability to classify patterns on which they have not been trained. Moreover, the performance to ANN based models mainly depends both on the ANN parameters and on the quality of input variables. Whereas, an exhaustive search on either appropriate parameters or predictive inputs is very computationally expansive. In this paper, we propose a new hybrid model based on genetic algorithms and artificial neural networks. Our evolutionary classifier is capable of selecting the best set of predictive variables, then, searching for the best neural network classifier and improving classification and generalization accuracies. The designated model was applied to the problem of bankruptcy forecasting, experiments have shown very promising results for the bankruptcy prediction in terms of predictive accuracy and adaptability.

6. A Comparative study of sample selection methods for classification

Patricia E.N. Lutu ; Andries P. Engelbrecht.

Sampling of large datasets for data mining is important for at least two reasons. The processing of large amounts of data results in increased computational complexity. The cost of this additional complexity may not be justifiable. On the other hand, the use of small samples results in fast and efficient computation for data mining algorithms. Statistical methods for obtaining sufficient samples from datasets for classification problems are discussed in this paper. Results are presented for an empirical study based on the use of sequential random sampling and sample evaluation using univariate hypothesis testing and an information theoretic measure. Comparisons are made between theoretical and empirical estimates.

7. A Word Game Support Tool Case Study

T. Botha ; D.G. Kourie ; B.W. Watson.

This article reports on the approach taken, experience gathered, and results found in building a tool to support the derivation of solutions to a particular kind of word game. This required that techniques had to be derived for simple yet acceptably quick access to a dictionary of natural language words (in the present case, Afrikaans). The main challenge was to access a large corpus of natural language words via a partial match retrieval technique. Other challenges included discovering how to represent such a dictionary in a "semi-compressed" format, thus arriving at a balance that favours search speed but nevertheless derives a savings on storage requirements. In addition, a query language had to be developed that would effectively exploit this access method. The system is designed to support a more intelligent query capability in the future. Acceptable response times were achieved even though an interpretive scripting language, ObjectREXX, was used.