AI for malware detection

Articles

Today, more than ever, cybersecurity has become a strategic issue for businesses, as the malware threat continues to grow and evolve: from ransomware that cripples infrastructures, to stealers that rob confidential data, to subtle spywares that infiltrate systems.

For decades, signature-based detection has been an effective and widely-used method for identifying these threats. But today, this approach is showing its limitations in the face of increasingly sophisticated malware.

In this article, we explore how artificial intelligence, with its advanced algorithms and ability to analyze complex contexts, is revolutionizing the fight against malware, providing solutions that are equal to today’s challenges.

The signature-based detection is outdated

At the end of the 80s, the first antivirus programs used a detection method known as “fingerprinting” or digital “signature”. This technique consists in analyzing each file using an algorithm (known as a hash algorithm) which calculates a unique result, called a digital fingerprint. By comparing this digital fingerprint with a massive database of known threats, the antivirus can then determines whether the file is known to be malicious, and so process it.

To circumvent this detection method, cybercriminals quickly developed malware capable of changing their digital footprint. These are known as “polymorphs” for the simplest, or “metamorphs” for the most complex.

Polymorph malware

In response, cybersecurity experts have evolved the signature-based detection method, with new hashing algorithms that are more robust to changes in the file (such as SSDEEP – 2006), or the use of more advanced signatures such as YARA rules (2007), which enable the definition and sharing of complex and specific signatures (based in particular on the parts of the file that will remain unchanged whatever the iterations of the polymorphic software).

Although still widely used because they are so fast, these techniques remain limited and time-consuming. Indeed, new threats remain invisible until analyzed by an expert, and the number of false positives generated is often not really under control.

So what can we do?

Adding context

Today’s next-generation antivirus programs (NGaV) incorporate file behavior analysis to detect malicious actions, whether on the user workstation (file creation/deletion, access to protected resources, etc.) or on the network (communication with a remote command server, data exfiltration, etc.).

By taking into account multiple factors – in other words, some context – today’s antivirus programs are able to provide more accurate detection. For example, a document attached to an email should not be treated with the same attention depending on whether or not the sender is in our address book; or a mobile application should not be treated with the same attention if its required permissions do not match its description.

Thanks to their ability to handle complex contexts, artificial intelligence (AI) algorithms considerably improve diagnostic accuracy, while keeping the number of false positives under control.

Moreover, since AI models can be continuously trained on new data, it is possible to automate the processing of this mass of data with multiple inputs (text, images, code, actions…). This ability to adapt enables them to remain effective in the face of constantly evolving cybercriminal strategies.

Abstraction

Virtually all aspects of artificial intelligence are now deployed in the field of cybersecurity: classification or clustering algorithms, supervised learning (recognition of characteristic patterns), unsupervised learning (anomaly detection), reinforcement learning (continuous improvement), etc.

In recent years, however, it is above all advances in the field of deep learning (complex neural networks) that have opened the door to new detection capabilities.

Indeed, these algorithms can generate a representation that abstracts from the details and extracts the key concepts from what is given as input (find out more in our white paper). For example, it is possible to detect ransomware letters in images, identify whether a website is a phishing site, detect vulnerabilities in source code, or detect whether code is malicious in nature (e.g. PowerSheLLM, a malicious PowerShell code detection tool developed by GLIMPS).

However, although these AI algorithms have hitherto unrivalled performance, they also have their limits. In addition to the resources required to train and/or implement them, which can sometimes be substantial, the main limitation would be in the acquisition and management of data linked to the training of these AI models.

On the one hand, we need to be constantly attentive to the presence of erroneous correlations in training data, whether these correlations are intentional (dataset poisoning) or unintentional (a “natural” imbalance linked to data sources, for example).

On the other hand, there are many questions surrounding the labeling of samples (what should or should not be classified as malicious). In the case of PowerSheLLM, most of the false positives are legitimate administration scripts, but with rights and actions identical to those used in cyber attacks.

Today’s challenges

For GLIMPS, in addition to the issues surrounding the acquisition and management of training data, one of the main current challenges lies in adding explainability (XAI, eXplainable AI) to the output of our artificial intelligence models. Modern models, especially those based on deep learning, are often perceived as “black boxes”. This opacity can be problematic in a cybersecurity context, where it’s crucial to understand why a file or action is identified as malicious. Without this transparency, it becomes difficult for trusted experts to validate, interpret or correct AI decisions. But the subject of the explicability of AI models will be the subject of a future article.
In the meantime, keep in mind that cybercriminals are using similar technologies to generate malware, design phishing campaigns and set up increasingly sophisticated scams. So be vigilant about the data you share online, and adopt good practices such as using a password manager, enabling multi-factor authentication, and checking sources before clicking on a link or downloading a file. Tomorrow’s cyberthreats will exploit all available context to better target their victims.

Find out how AI can boost your detection and response strategies: request a demo!