High accuracy AI for malware classification

multicolored thumbnails of the papers of the paper

On Tuesday the paper Computer activity learning from system call time series that Curt and I wrote was posted to the Arxiv. It explains how we used machine learning to create a minute-by-minute description of what is happening on a computer. While I take one sip of coffee, my MacBook will have run hundreds of programs all doing things I can only discover through great effort. With the methods described we can use the computer to distill to a human scale the billions of things the CPU did during those seconds.

Built from system calls, the descriptions turned out to very useful in practice. In the paper we used the malware detection problem — which is considered an open problem in cybersecurity — as a test. Hackers gain access to computers by a variety of methods, but to benefit from that access they need to find useful information to take from it. This can happen in minutes, if what they want is easy to locate, such as an email archive or browser cookies, but it can take days if they have to look around or wait for that bank login. Attackers adopt different strategies on how to maintain access and how to exfiltrate data, but those typically involve using a piece of specialized software (the malware). The problem is to detect the malware as it executes.

The core of the method is a better distance function between executing programs. We provide several demonstrations it works well. One is to compute the distance between different versions of Firefox and Chrome. In the paper there is an array of the similarities beween different releases of Firefox. You can see how the similarity drops as the versions get further apart in time, and how Firefox and Chrome are more similar to each other than they are to most programs. All this across versions and without training on either program.

The main performance result is in Table 3 where we compare how we did in relation to other AIs (in case you are curious, F1 = 0.995 versus F1 = 0.857 at a false positive rate of 0.1%). The distance function together with the time series of activities from the computer make the malware detection task seem easy. It just took us 3 years to make it look that way.