Written By Jesse Sampson And Presented By Charles Leaver CEO Ziften
In the first post on edit distance, we looked at searching for destructive executables with edit distance (i.e., how many character edits it requires to make 2 text strings match). Now let’s look at how we can utilize edit distance to search for malicious domains, and how we can develop edit distance functions that can be integrated with other domain name functions to pinpoint suspect activity.
Case Study Background
Exactly what are bad actors playing at with malicious domains? It might be merely utilizing a close spelling of a typical domain name to fool careless users into looking at ads or picking up adware. Legitimate sites are gradually picking up on this technique, sometimes called typo-squatting.
Other malicious domains are the result of domain name generation algorithms, which might be used to do all types of nefarious things like evade counter measures that obstruct recognized jeopardized sites, or overwhelm domain name servers in a dispersed DoS attack. Older variants use randomly generated strings, while further advanced ones include tricks like injecting typical words, additionally puzzling protectors.
Edit distance can aid with both use cases: here we will find out how. First, we’ll exclude common domain names, given that these are usually safe. And, a list of regular domain names provides a standard for spotting abnormalities. One good source is Quantcast. For this discussion, we will stick to domains and prevent subdomains (e.g. ziften.com, not www.ziften.com).
After data cleansing, we compare each prospect domain (input data observed in the wild by Ziften) to its prospective next-door neighbors in the very same top-level domain (the tail end of a domain name – classically.com,. org, and so on and today can be practically anything). The standard task is to discover the nearby next-door neighbor in regards to edit distance. By discovering domain names that are one step removed from their nearest next-door neighbor, we can easily find typo-ed domain names. By discovering domain names far from their next-door neighbor (the normalized edit distance we introduced in Part 1 is useful here), we can likewise discover anomalous domains in the edit distance area.
Exactly what were the Outcomes?
Let’s look at how these results appear in reality. Be careful when browsing to these domain names since they might include malicious material!
Here are a few prospective typos. Typo squatters target popular domains since there are more opportunities somebody will check them out. Numerous of these are suspect according to our threat feed partners, however there are some false positives too with charming names like “wikipedal”.
Here are some strange looking domain names far from their next-door neighbors.
So now we have produced 2 beneficial edit distance metrics for searching. Not just that, we have three features to potentially add to a machine-learning model: rank of nearby neighbor, range from next-door neighbor, and edit distance 1 from neighbor, showing a threat of typo shenanigans. Other features that might play well with these include other lexical functions such as word and n-gram distributions, entropy, and string length – and network functions like the number of unsuccessful DNS requests.
Streamlined Code that you can Experiment with
Here is a streamlined variation of the code to have fun with! Created on HP Vertica, but this SQL should run with many advanced databases. Note the Vertica editDistance function may vary in other executions (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).