Patricia E.N. Lutu ; Andries P. Engelbrecht - A Comparative study of sample selection methods for classification

arima:1880 - Revue Africaine de Recherche en Informatique et Mathématiques Appliquées, September 2, 2007, Volume 6, april 2007, joint Special Issue ARIMA/SACJ on Advances in end-user data mining techniques - https://doi.org/10.46298/arima.1880
A Comparative study of sample selection methods for classification

Authors: Patricia E.N. Lutu 1; Andries P. Engelbrecht 1

  • 1 Department of Informatics [Pretoria]

Sampling of large datasets for data mining is important for at least two reasons. The processing of large amounts of data results in increased computational complexity. The cost of this additional complexity may not be justifiable. On the other hand, the use of small samples results in fast and efficient computation for data mining algorithms. Statistical methods for obtaining sufficient samples from datasets for classification problems are discussed in this paper. Results are presented for an empirical study based on the use of sequential random sampling and sample evaluation using univariate hypothesis testing and an information theoretic measure. Comparisons are made between theoretical and empirical estimates.


Volume: Volume 6, april 2007, joint Special Issue ARIMA/SACJ on Advances in end-user data mining techniques
Published on: September 2, 2007
Submitted on: February 16, 2007
Keywords: dataset sampling, data analysis, machine learning, classification, information measures,échantillonnage d’ensemble de données,analyse de données,apprentissage de machine,classification,[INFO] Computer Science [cs],[MATH] Mathematics [math]

Linked publications - datasets - softwares

Source : ScholeXplorer IsRelatedTo DOI 10.1145/335191.335384
Source : ScholeXplorer IsRelatedTo DOI 10.1145/342009.335384
Source : ScholeXplorer IsRelatedTo DOI 10.1184/r1/6604721
Source : ScholeXplorer IsRelatedTo DOI 10.1184/r1/6604721.v1
  • 10.1145/342009.335384
  • 10.1184/r1/6604721.v1
  • 10.1145/335191.335384
  • 10.1184/r1/6604721
an improved method for data mining and clustering

Consultation statistics

This page has been seen 256 times.
This article's PDF has been downloaded 403 times.