Theme Identification using Machine Learning Techniques

Siti Hajar Jayady, Hasmawati Antong

DOI: https://doi.org/10.51662/jiae.v1i2.24

Article Viewers

Abstract viewed: 203 times
PDF viewed: 180 times

Abstract


With the abundance of online research platforms, much information presented in PDF files, such as articles and journals, can be obtained easily. In this case, students completing research projects would have many downloaded PDF articles on their laptops. However, identifying the target articles manually within the collection can be tiring as most articles consist of several pages that need to be analyzed. Reading each article to determine if the article relates theme and organizing the articles based on themes is time and energy-consuming. Referring to this problem, a PDF files organizer that implemented a theme identifier is necessary. Thus, work will focus on automatic text classification using the machine learning methods to build a theme identifier employed in the PDF files organizer to classify articles into augmented reality and machine learning. A total of 1000 text documents for both themes were used to build the classification model. Moreover, the pre-preprocessing step for data cleaning and TF-IDF feature extraction for text vectorization and to reduce sparse vectors were performed. 80% of the dataset were used for training, and the remaining were used to validate the trained models. The classification models proposed in this work are Linear SVM and Multinomial Naïve Bayes. The accuracy of the models was evaluated using a confusion matrix. For the Linear SVM model, grid-search optimization was performed to determine the optimal value of the Cost parameter.


Keywords


Multinomial Naïve Bayes; Portable Document File (PDF); Pre-processing; Term Frequency-Inverse Document Frequency (TF-IDF);

Full Text:

PDF

Refbacks

  • There are currently no refbacks.


Journal of Integrated and Advanced Engineering (JIAE),
Published by:
Asosiasi Staf Akademik Perguruan Tinggi Seluruh Indonesia (ASASI):http://asasi.id/

p-ISSN: 2774-602X
e-ISSN: 2774-6038
Journal URL: https://asasijournal.id/index.php/jiae/
Journal DOI: 10.51662/jiae

Web
Analytics Made Easy - StatCounter

View My Stats


The Journal is Indexed and Journal List Title by:


@2022 Copyright by ASASICreative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.