Detecting PDF JS Obfuscation using Elementary Statistics

CVE-2013-0641was recently found in the wild. I was fortunate enough to get a sample. I have to apologize beforehand but I won't be able to share the sample. If I come across a public sample and the exploit has been patched I will share it. Someone has added an incomplete section of the code to pastebin. To be honest I have done very little analysis of the sample. The farthest I have gotten has been staring at the structure of the JS (Javascript).  I haven't even deobfuscated it. Why? I thinks it's kind of elegant compared to the typical obfuscations we would see in BlackHole Explot kits or other exploit packs.  A typical side effect of obfuscated Javascript is the structure of the code is destroyed. The code looks like one massive one liner or blocks of code rather a structured layout. From an interpreter standpoint the Javascript can be structured in any format. The code just needs the correct syntax. Below we have obfuscated Javascript from a PDF. The first stage is a massive one liner and the second stage is cleaned up using jsbeautifier

Stage 1 -Non-Tabbed, Stage 2 Tabbed Added
In the second stage the structure is more visually appealing but there are some flags that this wasn't written by a human. The first one is the line with the variable 'a' which is 10,579 chars long. Javascript has a flow when it's written by a human. Take for example the below function.

function sHOGG(c,d,e){
    var idx = d % c.length;
    var s = "";
    while (s.length < c.length){
        s += c[idx];
        idx = (idx + e) % c.length;
    return s;

Code has a visual flow structure to it. If we were to look at the black as negative space we can see the flow. Tab, Tab, Tab, Tab>Tab, Tab>Tab, Tab, etc. Anyone who has programmed in Python understands this flow.
Most programmers use this structure because it's easier to read. Sometimes code will have new lines chars stripped to save space but the code can be cleaned up using jsbeautifier. This will give a somewhat original state. Even when structurally cleaned up most obfuscation destroy the flow. How does it destroy it? Well let's graph the code and find out. Note all Python code can be found at the end of the post.

Okay time for the disclaimer. I wrote all the code and came up with the concept on three hours of sleep after a late night. I almost didn't post it but it made me starting thinking about lexical analysis, graph theory and all the cool stuff people smarter than myself are doing. Hopefully it does the same for others.

These images are simple plot graphs with the length of the line as the y axis and the line count as the x axis. The non-malicious Javascripts were grabbed from my local machine. The first plot of jquery-cycle.lite is a good example of the typical pattern/flow that JS looks like when plotted. MicrosoftAjax[1].js is an 8k+ lined Javascript that shows a typical dense pattern. Below are obfuscated JS extracted from malicious PDFs.
Second PDF JS
In the Second PDF JS image we can see the line that contains that 10,579 chars. The data is very flat with very little variation in length. If we were to compare the mean to median we would see it's vastly skewed because of the single line which contains most of the obfuscated data.
JS from CVE-2013-0641
CVE-2013-0641 is very interesting because of the different clusters in the data. There are three patterns that can be found. The first is 0 through before 6,000, after that we have block that is consistently in range of 550 to 700 chars in length and then at the end we have what looks consistent with non-obfuscated JS. Examples of the code structure can be seen below for the three patterns.

Structure Example Code 0 through 5,800

Structure Example Code 5,800 through 6,400

Structure Example Code 6,400 to EOF
At this point I started to wonder about different techniques for detecting consistent plotted data blocks, clusters and other techniques for detecting the difference in the obfuscated JS and non-obfuscated. If anyone has any comments or ideas please shoot me an email or leave a comment. The next question is how can we use this to detect a malicious JS in a PDF such as CVE-2013-0641? Might as well keep it simple and use elementary statistics. If we were to do a comparison of the mean and the median we would notice the mean is skewed due to the length of lines of obfuscated code.

Here are some example pulled from PDFs with obfuscated Javascript
Mean 621.395348837 - Median 10.0
Mean 341.941176471 - Median 18.5
Mean 92.0138528139 - Median 42.0 (CVE-2013-0641)

If the Mean divided by Median is greater than two is a decent range to detect suspicious code. Time for the play at home version. The following is an example of the commands for the below script. We will need scipy, matplotlib, numpy and jsbeautifier.

>>> n = GraphMe() // Create instance 
>>> n.process(open('pdf1.out', 'r')) // open JS file
>>> n.plot() // plot it

>>> n.outlier() // check if the JS is suspicious 
Suspicious: mean 642.476190476 median 11.5

I might create a repo for it. Couldn't think of a name. Which seems to be the hardest part of creating a repo.

## created by alexander<dot>hanel<at>gmail<dot>com
## 2/21/2013
## No license, free game to use, just give credit or you suck. 
import sys
from StringIO import StringIO
import pylab as pylab
import matplotlib.pyplot as plt
import numpy as num 
import jsbeautifier

class GraphMe():
    def __init__(self):
        self.fullData = ''
        self.bjs = False
        self.PS = True
        self.plotData = []
        self.x = []
        self.y = []

    def beautifier(self, buffer):
        'clean up the JS'
            temp = jsbeautifier.beautify(
            print "ERROR: jsbeautifier" 
            print "EXITING...."
        return temp 
    def process(self,data):
        if self.bjs == True:
            data = self.beautifier(data)
        if type(data) is str:
            data = StringIO(data)
        self.fullData = data.readlines()
        # clean up JS that is all one line 
        if len(self.fullData) == 1 or self.PS == True:
            self.PS = False
            self.bjs = True
        for t in range(len(self.fullData)): self.x.append(t)
        for t in self.fullData : self.y.append(len(t))

    def outlier(self):
        'calcuate if mean/median < 2'
        if num.mean(self.y)/num.median(self.y) > 2:
            print "Suspicious: mean %s median %s" % (num.mean(self.y), num.median(self.y))

    def graph(self):
        'create graph of the JS'
        fig = pylab.figure()
        ax = fig.add_subplot(1,1,1),self.y)

    def plot(self):
        'create plot of the JS'
        plt.plot(self.y, 'ro-')


  1. Statistical(ly)Suspects that can be like StaSus

    1. I like it. Now that I have a name I just need to create the repo. Thanks.

  2. there is way too much legitimate obfuscated code out there... you have to account for the false positives.

    1. I completely agree except for instances of obfuscated code in PDFs.