Identification thématique Arabe basée sur des études empiriques des topic models

Marwa Naili; Anja Chaibi; Henda Ghézala

doi:10.46298/arima.3102

Marwa Naili ; Anja Chaibi ; Henda Ghézala - Identification thématique Arabe basée sur des études empiriques des topic models

arima:3102 - Revue Africaine de Recherche en Informatique et Mathématiques Appliquées, 3 août 2017, Volume 27 - 2017 - Numéro spécial CARI 2016 - https://doi.org/10.46298/arima.3102

Identification thématique Arabe basée sur des études empiriques des topic modelsArticle

Auteurs : Marwa Naili ¹; Anja Chaibi ¹; Henda Ghézala ¹

1 Laboratoire de recherche en Génie Logiciel, Applications distribuées, Systèmes décisionnels et Imagerie intelligente [Manouba]

[en]
This paper focuses on the topic identification for the Arabic language based on topic models. We study the Latent Dirichlet Allocation (LDA) as an unsupervised method for the Arabic topic identification. Thus, a deep study of LDA is carried out at two levels: Stemming process and the choice of LDA hyper-parameters. For the first level, we study the effect of different Arabic stemmers on LDA. For the second level, we focus on LDA hyper-parameters α and β and their impact on the topic identification. This study shows that LDA is an efficient method for Arabic topic identification especially with the right choice of hyper-parameters. Another important result is the high impact of the stemming algorithm on topic identification.

[fr]
Cet article met l'accent sur l'identification thématique pour la langue arabe basée sur les topic models. Nous étudions l'Allocation de Dirichlet Latente (LDA) comme une méthode non supervisée pour l'identification thématique. Ainsi, une étude approfondie de LDA a été effectuée à deux niveaux: le processus de lemmatisation et le choix des hyper-paramètres. Pour le premier niveau, nous étudions l'effet des différents lemmatiseurs sur LDA. Pour le deuxième niveau, nous nous focalisons sur les hyper-paramètres α et β de LDA et leurs impacts sur l'identification. Cette étude montre que LDA est une méthode efficace pour l'identification thématique Arabe surtout avec le bon choix des hyper-paramètres. Un autre résultat important est l'impact élevé de l'algorithme de lemmatisation sur l'identification thématique.

https://doi.org/10.46298/arima.3102

Source : HAL:hal-01444574v2

Volume : Volume 27 - 2017 - Numéro spécial CARI 2016

Publié le : 3 août 2017

Accepté le : 3 juillet 2017

Soumis le : 2 août 2017

Mots-clés : ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing/I.2.7.6: Text analysis, [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, [en] Topic identification, Latent Dirichlet Allocation, LDA hyper- parameters α and β, Arabic stemmers; [fr] Identification thématique, Topic models, Allocation de Dirichlet Latente, hyper-paramètres α et β de LDA, lemmatiseurs Arabes

Licence : Attribution - Pas de Modification 4.0 International (CC BY-ND 4.0)

Références bibliographiques

9 Documents citant cet article

Rahman Nahi Abid;Hassan Naderi, 2025, Leveraging Topic Features in Prediction of Social Network Community Evolutions, IEEE Access, 13, pp. 110972-110985, 10.1109/access.2025.3578551, https://doi.org/10.1109/access.2025.3578551.

Usama Shahid;Muhammad Zunnurain Hussain;William Sayers, 2025, Computational Analysis of Quran Text Using Machine Learning and Large Language Models, Research Repository (University of Gloucestershire), pp. 18-24, 10.1109/cdma61895.2025.00009, https://orcid.org/0009-0005-6360-333X.

Jinsu Choi;Hyewon Chung, 2024, Analysis of Research Trends in Process Data using Text Mining, Journal of Curriculum and Evaluation, 27, 3, pp. 197-221, 10.29221/jce.2024.27.3.197, http://dx.doi.org/10.29221/jce.2024.27.3.197.

Myeong Seon Lee;Hyun-Sook Chung;Jin Sun Kim, 2023, Analysis of online parenting community posts on expanded newborn screening for metabolic disorders using topic modeling: a quantitative content analysis, Korean journal of women health nursing/Yeoseong geon'gang ganho hag'hoeji/Yeoseong geon-gang ganho hakoeji, 29, 1, pp. 20-31, 10.4069/kjwhn.2023.02.21, https://doi.org/10.4069/kjwhn.2023.02.21.

Dong-Joon Jung, 2022, Political Polarization on Social Media Conversations about COVID-19 Vaccination: Evidence from the Word Network Analysis and Topic Modeling of Twitter Messages in South Korea, Journal of Social Science, 33, 2, pp. 85-123, 10.16881/jss.2022.04.33.2.85.

Taejong Kim;Sumi Chae;Hyeyun Kim, 2022, Analysis of trauma issues in Korean society based on topic modeling, Journal of Digital Contents Society, 23, 3, pp. 503-522, 10.9728/dcs.2022.23.3.503, https://doi.org/10.9728/dcs.2022.23.3.503.

Mohammed A. AlGhamdi;Murtaza Ali Khan, 2020, Intelligent Analysis of Arabic Tweets for Detection of Suspicious Messages, Arabian Journal for Science and Engineering, 45, 8, pp. 6021-6032, 10.1007/s13369-020-04447-0.

Sergei Koltcov;Vera Ignatenko;Olessia Koltsova, 2019, Estimating Topic Modeling Performance with Sharma–Mittal Entropy, Entropy, 21, 7, pp. 660, 10.3390/e21070660, https://doi.org/10.3390/e21070660.

Aya M. Al-Zoghby;Khaled Shaalan, 2018, Ontological Optimization for Latent Semantic Indexing of Arabic Corpus, Procedia Computer Science, 142, pp. 206-213, 10.1016/j.procs.2018.10.477, https://doi.org/10.1016/j.procs.2018.10.477.

Sources : OpenCitations, OpenAlex & Crossref

Partager et exporter

Statistiques de consultation

Cette page a été consultée 1198 fois.

Le PDF de cet article a été téléchargé 1686 fois.