Применение языковых нейросетевых моделей для обнаружения вредоносного программного обеспечения

Дудкин Д.М.; Кузнецов М.А.; Авдосев Н.Г.; Шабаловский В.А.; Егунов В.А.

Application of language neural network models for malware detection

Dudkin D.M., Kuznetsov M.A., Avdosev N.G., Shablovsky V.A., Egunov V.A.

Incoming article date: 10.05.2024

The growing popularity of large language models in various fields of scientific and industrial activity leads to the emergence of solutions using these technologies for completely different tasks. This article suggests using the BERT, GPT, and GPT-2 language models to detect malicious code. The neural network model, previously trained on natural texts, is further trained on a preprocessed dataset containing program files with malicious and harmless code. The preprocessing of the dataset consists in the fact that program files in the form of machine instructions are translated into a textual description in a formalized language. The model trained in this way is used for the task of classifying software based on the indication of the content of malicious code in it. The article provides information about the conducted experiment on the use of the proposed model. The quality of this approach is evaluated in comparison with existing antivirus technologies. Ways to improve the characteristics of the model are also suggested.

Keywords: antivirus, neural network, language models, malicious code, machine learning, model training, fine tuning, BERT, GPT, GPT-2