A Novel Slang and Formal Text Classification with Data Exploration and Optimized Deep Learning Models

Hoger K. Omar

doi:10.12928/biste.v8i3.14373

Authors

Hoger K. Omar University of Kirkuk

DOI:

https://doi.org/10.12928/biste.v8i3.14373

Keywords:

Exploratory Data Analysis, Hyperparameter Algorithms, Text Categorization, Slang Classification, Deep Learning

Abstract

Automated text classification involves applying artificial intelligence algorithms to classify text documents into predefined categories. Hence, developing a high-accuracy text categorization model is a significant task, especially in unstructured narratives such as research papers, medical documents, and news articles. This study examines the application of an artificial neural network (ANN) algorithm for categorizing formal and slang English language with the capabilities of popular deep learning frameworks such as TensorFlow and Keras. First of all, the dataset's features were examined through exploratory data analysis (EDA) methods to enhance understanding. Furthermore, the study emphasizes the use of several preprocessing techniques to address the challenge presented by the informal writing style. In addition, adding a list of common English abbreviations greatly improved the accuracy and effectiveness of classifying text. Lastly, the work involves using multiple hyperparameter optimization approaches for further enhancement. The proposed techniques effectively mitigated the impact of heterogeneous and noisy data in both formal and informal language by achieving an improvement of approximately 10% in overall classification accuracy. Additionally, the study contributes to an advancement in the field of text mining and offers practical guidance for optimizing deep learning models in the domain of English text categorization.

References

V. Dogra, S. Verma, Kavita, P. Chatterjee, J. Shafi, J. Choi, and M. F. Ijaz, "A Complete Process of Text Classification System Using State-of-the-Art NLP Models," Computational Intelligence and Neuroscience, pp. 1-26, 2022, https://doi.org/10.1155/2022/1883698.

A. Dhar, H. Mukherjee, N. S. Dash, and K. Roy, “Text categorization: past and present,” Artificial Intelligence Review, vol. 54, no. 4, pp. 3007-3054, 2021, https://doi.org/10.1007/s10462-020-09919-1.

R. A. Sinoara, J. Camacho-Collados, R. G. Rossi, R. Navigli, and S. O. Rezende, “Knowledge-enhanced document embeddings for text classification,” Knowledge-Based Systems, vol. 163, pp. 955-971, 2019, https://doi.org/10.1016/j.knosys.2018.10.026.

A. I. Kadhim, "Survey on supervised machine learning techniques for automatic text classification," Artificial Intelligence Review, vol. 52, p. 273–292, 2019, https://doi.org/10.1007/s10462-018-09677-1.

J. T. Pintas, L. A. Fernandes, and A. C. B. Garcia, “Feature selection methods for text classification: a systematic literature review,” Artificial Intelligence Review, vol. 54, no. 8, pp. 6149-6200, 2021, https://doi.org/10.1007/s10462-021-09970-6.

H. K. Omar, M. Frikha, and A. K. Jumaa, “Improving big data recommendation system performance using NLP techniques with multi attributes,” Informatica, vol. 48, no. 5, 2024, https://doi.org/10.31449/inf.v48i5.5255.

S. Guo, X. Li, and Z. Mu, “Adversarial machine learning on social network: A survey,” Frontiers in Physics, vol. 9, p. 766540, 2021, https://doi.org/10.3389/fphy.2021.766540.

H. K. Omar, M. Frikha, and A. K. Jumaa, “big data cloud-based recommendation system using NLP techniques with machine and deep learning,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 21, no. 5, pp. 1076-1083, 2023, https://doi.org/10.12928/telkomnika.v21i5.24889.

H. Wu, S. Qin, R. Nie, J. Cao and S. Gorbachev, "Effective Collaborative Representation Learning for Multilabel Text Categorization," in IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 10, pp. 5200-5214, 2022, https://doi.org/10.1109/TNNLS.2021.3069647.

J. Van Landeghem, M. Blaschko, B. Anckaert and M. -F. Moens, "Benchmarking Scalable Predictive Uncertainty in Text Classification," in IEEE Access, vol. 10, pp. 43703-43737, 2022, https://doi.org/10.1109/ACCESS.2022.3168734.

K. Thirumoorthy and K. Muneeswaran "Feature selection using hybrid poor and rich optimization algorithm for text classification," Pattern Recognition Letters, vol. 147, pp. 63-70, 2021, https://doi.org/10.1016/j.patrec.2021.03.034.

H. Benhar, A. Idri, and J. L. Fernández-Alemán, "Data preprocessing for heart disease classification: A systematic literature review," Computer Methods and Programs in Biomedicine, vol. 195, p. 105635, 2020, https://doi.org/10.1016/j.cmpb.2020.105635.

T. Baldwin and Y. Li, “An in-depth analysis of the effect of text normalization in social media,” In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 420-429, 2015, https://doi.org/10.3115/v1/N15-1045.

M. F. R. Abu Bakar, N. Idris, L. Shuib and N. Khamis, "Sentiment Analysis of Noisy Malay Text: State of Art, Challenges and Future Work," in IEEE Access, vol. 8, pp. 24687-24696, 2020, https://doi.org/10.1109/ACCESS.2020.2968955.

D. Baktibayev, A. Serek, B. Berlikozha, and B. Rustauletov, "Resource-Efficient Sentiment Classification of App Reviews Using a CNN-BiLSTM Hybrid Model," Buletin Ilmiah Sarjana Teknik Elektro, vol. 7, no. 3, pp. 427-433, 2025, https://doi.org/10.12928/biste.v7i3.13954.

P. Vashisth and K. Meehan, "Gender Classification using Twitter Text Data," 2020 31st Irish Signals and Systems Conference (ISSC), pp. 1-6, 2020, https://doi.org/10.1109/ISSC49989.2020.9180161.

H. K. Omar, M. Frikha, and A. K. Jumaa, "PyTorch and TensorFlow Performance Evaluation in Big Data Recommendation System," Ingénierie des Systèmes d’Information, vol. 29, no. 4, pp. 1357-1364, 2024, https://doi.org/10.18280/isi.290411.

R. Vinayakumar, K. P. Soman and P. Poornachandran, "Deep encrypted text categorization," 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 364-370, 2017, https://doi.org/10.1109/ICACCI.2017.8125868.

E. S. Alamoudi and S. A. Azwari, "Exploratory Data Analysis and Data Mining on Yelp Restaurant Review," 2021 National Computing Colleges Conference (NCCC), pp. 1-6, 2021, https://doi.org/10.1109/NCCC49330.2021.9428850.

E. G. İlgün and M. Dener, “Exploratory data analysis, time series analysis, crime type prediction, and trend forecasting in crime data using machine learning, deep learning, and statistical methods,” Neural Computing and Applications, vol. 37, no. 18, pp. 11773-11798, 2025, https://doi.org/10.1007/s00521-025-11094-9.

G. G. Ro ‘ziyeva, B. I. Otaxonova, and M. E. Shaazizova, “Text Classification for Social Networks: Solving Short Text and Informal Language Problems,” In Conference on Internet of Things and Smart Spaces, pp. 121-127, 2024, https://doi.org/10.1007/978-3-031-95296-8_11.

N. Hidayani, T. Mantoro and M. A. Ayu, "Deep Learning Model for Sentiment Analysis in the Use of Informal Language and Slang On Social Media," 2024 10th International Conference on Computing, Engineering and Design (ICCED), pp. 1-5, 2024, https://doi.org/10.1109/ICCED64257.2024.10983073.

S. R. Naher, S. Sultana, T. Mahmud, M. T. Aziz, M. S. Hossain and K. Andersson, "Exploring Deep Learning for Chittagonian Slang Detection in Social Media Texts," 2024 International Conference on Electrical, Computer and Energy Technologies (ICECET, pp. 1-6, 2024, https://doi.org/10.1109/ICECET61485.2024.10698491.

Z. Sun, Q. Hu, R. Gupta, R. Zemel, and Y. Xu, “Toward informal language processing: Knowledge of slang in large language models,” In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1683-1701, 2024, https://doi.org/10.18653/v1/2024.naacl-long.94.

M. Orosoo, S. Govindasamy, N. Bayarsaikhan, Y. Rajkumari, G. Fatma, R. Manikandan, and B. K. Bala, “Performance analysis of a novel hybrid deep learning approach in classification of quality-related English text,” Measurement: Sensors, vol. 28, p. 100852, 2023, https://doi.org/10.1016/j.measen.2023.100852.

H. Khataei Maragheh, F. S. Gharehchopogh, K. Majidzadeh, and A. B. Sangar, “A new hybrid based on long short-term memory network with spotted hyena optimization algorithm for multi-label text classification,” Mathematics, vol. 10, no. 3, p. 488, 2022, https://doi.org/10.3390/math10030488.

A. Zhang, B. Li, W. Wang, S. Wan, and W. Chen, “MII: A Novel Text Classification Model Combining Deep Active Learning with BERT,” Computers, Materials & Continua, vol. 63, no. 3, 2020, https://doi.org/10.32604/cmc.2020.09962.

Manik, L. P. (2022). On the role of text preprocessing in BERT embedding-based DNNs for classifying informal texts. International Journal of Advanced Computer Science and Applications (IJACSA), 2022, https://doi.org/10.14569/IJACSA.2022.01306109.

S. Piscitelli, E. Arnaudo and C. Rossi, "Multilingual Text Classification from Twitter during Emergencies," 2021 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 2021, pp. 1-6, 2021, https://doi.org/10.1109/ICCE50685.2021.9427581.

Rianto, A. B. Mutiara, E. P. Wibowo, and P. I. Santosa, “Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation,” Journal of Big Data, vol. 8, no. 1, p. 26, 2021, https://doi.org/10.1186/s40537-021-00413-1.

L. Qing, W. Linhong, and D. Xuehai, “A novel neural network-based method for medical text classification,” Future Internet, vol. 11, no. 12, p. 255, 2019, https://doi.org/10.3390/fi11120255.

J. Wang, Y. Li, J. Shan, J. Bao, C. Zong and L. Zhao, "Large-Scale Text Classification Using Scope-Based Convolutional Neural Network: A Deep Learning Approach," in IEEE Access, vol. 7, pp. 171548-171558, 2019, https://doi.org/10.1109/ACCESS.2019.2955924.

X. Wang and H. C. Kim, “Text Categorization with Improved Deep Learning Methods,” Journal of Information & Communication Convergence Engineering, vol. 16, no. 2, 2018, https://doi.org/10.6109/jicce.2018.16.2.106.

S. Aldera, A. Emam, M. Al-Qurishi, M. Alrubaian and A. Alothaim, "Exploratory Data Analysis and Classification of a New Arabic Online Extremism Dataset," in IEEE Access, vol. 9, pp. 161613-161626, 2021, https://doi.org/10.1109/ACCESS.2021.3132651.

K. Sahoo, A. K. Samal, J. Pramanik, and S. K. Pani, “Exploratory data analysis using Python,” International Journal of Innovative Technology and Exploring Engineering, vol. 8, no. 12, pp. 4727-4735, 2019, https://doi.org/10.35940/ijitee.L3591.1081219.

A. Kulkarni and A. Shivananda. Natural language processing recipes. Apress. 2019. https://doi.org/10.1007/978-1-4842-4267-4.