Today’s cyber attacks have evolved from direct attacks on physical systems to a more sophisticated form of Internet-based ones, which implies that any web service or user can be a target of attackers. Specifically, attackers compromise a legitimate website to spread a malicious code to website visitors, and this type of attack is so called a drive-by download attack [1–4]. Recent studies [5–8] have shown that a drive-by download is the most prevalent type of attack amongst all Internet-based threats.
Identifying a distribution site and blocking them in advance is one of the best ways to prevent a drive-by download attack. The other way to prevent this is removing the vulnerability of the website and protecting it from being compromised. First, to effectively detect a MDN, various methods have been proposed, and they can be classified into two categories. One is based on static analysis [9–11], such as pattern matching and metadata analysis; the other is the dynamic analysis method [12–14], which includes script emulation and virtual machine-based validation. However, existing works have limitations when it comes to recent sophisticated evasion methods, such as obfuscating, hiding, and circumventing the virtual machine environment. Second, to detect the vulnerability of the website, various tools have been proposed [15,16]. These tools are for analyzing the vulnerability of an application (or plug-in), such as PDF and FLASH, which are included in the website. However, the focus of our study is to more effectively identify a MDN that has already been created by attacker. Therefore, in this study we used the approach to analyze the features of the malicious website.
In this paper, we present an effective method for detecting recent advanced drive-by download attacks. The proposed method does not simply match the pattern of the contents, but analyzes the behavior of malware distribution. To analyze the connection of webpages used for malware distribution, it extracts the structure of the MDN using script emulation. We present four features of the MDN to detect a malicious connection of webpages. We generated these features from the analysis result of a collected dataset of a MDN and utilized features used in existing works.
We evaluated the detection performance of the proposed method by comparing it with VirusTotal , a URL scanning engine. The evaluation result shows that our method outperforms the existing detection method in terms of detection rate.
2. Related Work
2.1 Drive-by Download Attack
2.2 Detection Methods and Limitations
Various detection methods against drive-by download attacks have been proposed and they are classified into two categories: static analysis and dynamic analysis. First, static analysis utilizes information on the source codes of webpages or the metadata of websites.
Second, dynamic analysis methods basically utilize the virtual machine environment or web browser emulation. In virtual machine based methods [12,13], they visit websites with the vulnerable system setting (e.g., using an unpatched version of Internet Explorer and Windows OS) and check whether any malicious changes have occurred in the system. On the other hand, the emulation-based dynamic analysis method emulates the web browser engine to analyze the behavior of website that a user has visited .
Moreover, malicious or compromised websites send visitors different contents based on the HTTP request message. Fig. 1 shows a real world example where a website only responds with a malicious link if the visitor is using Internet Explorer 8.
3. ELPA: Emulation-Based Linked Page Map Analysis
3.1 Challenges for Detecting a Malware Distribution Network
In this paper, we focus on the third challenge. We introduce the concept of a Linked Page Map (LPM), which illustrates the automatic redirection in the MDN. We propose an Emulation-based Linked Page Map Analysis (ELPA), which analyzes LPM with a set of features and decides on its maliciousness.
3.2 Emulation-Based LPM Analysis (ELPA)
3.2.1 Linked Page Map
We define a Linked Page Map (LPM) as a chain of webpages potentially regarded as a MDN. A LPM is represented with a directed graph, as shown in Fig. 2. A LPM consists of nodes and directed links. Each node indicates a webpage and a directed link exists between two pages if an automatic redirection can occur. There can be multiple paths via different landing and hopping pages to the same exploit page.
Each node is represented as a tuple with 6 attributes as follows:
where i,j ∈ N
3.2.2 Proposed features
We statistically analyzed 7,390 LPMs (3,437 MDNs and 3,953 non-MDNs) that were previously collected and selected existing features that are less affected by attackers and are closely related to an indication of malware distribution. Also, we defined four new features, which will be described in detail in the next subsection. Table 3 shows the LPM features used by ELPA.
Feature 1. The number of nodes in a LPM
The first feature is the number of nodes in a LPM. The attacker makes the MDN by hacking a regular website. Therefore, the MDN has more nodes than a non-MDN. Fig. 3 shows the comparison of the number of nodes in the LPM of the MDN and non-MDN. As illustrated in the figure, in the case of the non-MDN, most of LPMs have single or double nodes compared to the LPM of the MDN, which has multiple nodes. Thus, this number can be used as a feature of LPM. This feature can be represented as Eq. (2):
Feature 2. Number of domains in a LPM
As previously mentioned, a MDN usually consists of various types of webpages and websites. In addition, the sites that spread the malicious code have a different domain that has been in operation. Because it is created by the attacker. Fig. 4 illustrates the number of different domains used in the MDNs and non-MDN. We can see that the MDN contains multiple domains; whereas, most of the non-MDNs consist of single or two domains. Based on this result, we set the number of domains used in LPM as a feature. Feature 2 can be represented as Eq. (3):
Feature 3. Characteristics of a web link in a MDN
In a drive-by download attack, a victim is redirected through a path consisting of a landing site, multiple hopping sites, and an exploiting site. As illustrated in Fig. 5, some hopping sites are shared by multiple MDNs. One may think that the number of hopping sites and the linking method between the sites can be a feature that represents the MDN. However, since the number of sites and linking method can vary depending on the attackers who construct MDNs, they are not an appropriate feature.
Feature 4. Obfuscation (Encryption)
In contrast to the existing features, the proposed four features do not only rely on the contents of the webpage, but on the behavior and structure of the malware distribution process. Thus, the proposed features are more robust against changes in attack methods and strategies. In the following subsection, we introduce the detection algorithm of ELPA, which utilizes the proposed features.
3.3 Algorithm for ELPA
The MDN detection algorithm for ELPA is presented in Table 4. It checks whether the LPM satisfies all four features described in Section 3.2 and decides on its maliciousness. In Step 1, the algorithm initializes the LPM structure and the number of nodes and domains are counted in Step 2 (Features 1 and 2). In Step 3, it examines if the webpage is obfuscated and whether any parameter values exist in the URL. Step 4 finalizes the maliciousness of the LPM based on the analysis result of the previous steps.
4.1 Evaluation Environment
We evaluated the performance of ELPA by comparing it with the URL scanning engine of VirusTotal . Fig. 7 illustrates the evaluation environment. During the two-week evaluation period, the test bed periodically downloaded 508 live websites and ELPA fetched a LPM from the websites using the web browsing emulator  and analyzed it. Then, ELPA decided whether the LPM was a MDN by examining it with the features described in Section 3. In addition, we used an open API provided by VirusTotal to submit the URL information to its URL scanning engine and identified the maliciousness of the webpage.
4.2 Website URL Dataset
We selected 508 live websites that have been compromised at least once by an attacker for a drive-by download attack within a year and used them as the seed for our dataset. Since the status of the websites can be changed during the year, we examined the websites using a two-phase evaluation. In the first phase, we input the URLs of websites into VirusTotal and saw the decision of detection engines. If all of the engines made a decision that a website is a clean site, we categorized it as a benign site. Otherwise, we manually examined its maliciousness. Table 5 summarizes how we categorized the websites in our seed dataset.
4.3 Evaluation Results
During the two-week evaluation, the ELPA periodically visited and checked all 508 websites 42 times. At the beginning of the evaluation, we found that three websites were still compromised and used as a landing site of a MDN. Fig. 8 illustrates the number of MDNs detected during the evaluation period. In the entire period, ELPA found 346 MDNs and some webpages were detected in consecutive rounds (represented as dotted circles in the figure). This is due to the fact that if a compromised website has not been fixed within two rounds, ELPA can detect an MDN rooted from the website in both rounds.
Out of the 346 MDNs in total, there were 17 unique MDNs and malicious (or compromised) webpages that constructed those MDNs. We gave the URLs of the webpages to VirusTotal. Table 6 shows our comparison results of malicious webpage detection. As shown in the table, 75% of malicious webpages detected by ELPA can be detected using existing solutions supported by VirusTotal at most. This means ELPA is a more robust method that advances evasion methods more than all of the other existing solutions listed in the table.
In our future work, we will improve ELPA to precisely extract LPMs from a chain of webpages that use more advanced evasion techniques. Moreover, we plan to resolve some limitations of the current ELPA method, such as detecting the case where a landing page is the only node in a MDN.
Sang-Yong Choi http://orcid.org/0000-0001-5152-3897
He received his B.S. degree in Mathematics and M.S. degree in Computer Science, both from Hannam University in 2000 and 2003, and Ph.D. degree in Interdisciplinary of Information Security from Chonnam National University in 2014, Korea. He is a principal researcher at the Cyber Security Research Center in Korea Advanced Institute of Science and Technology (KAIST). His research interests are in web security, network security and privacy.
Daehyeok Kim http://orcid.org/0000-0002-7439-1783
He received his B.S. degree in Computer Science and Engineering and M.S. degree in IT Convergence Engineering, both from Pohang University of Science and Technology (POSTECH), Korea. He is a senior researcher at the School of Computing in Korea Advanced Institute of Science and Technology (KAIST), Korea. His research interests are in distributed systems, networking, and their security and privacy.
Yong-Min Kim http://orcid.org/0000-0002-5066-3908
He received his Ph.D. in Dept. of Computer Science and Statics in Chonnam National University, Korea. He is an associate professor at Dept. of Electronic Commerce, Chonnam National University, Yeosu, Korea. His research interests are in security and privacy, system and network security, application security as electronic commerce.
1. N. Provos, D. McNamee, P. Mavrommatis, K. Wang, and N. Modadugu, "The ghost in the browser analysis of web-based malware," in Proceedings of the 1st Conference on First Workshop on Hot Topics in Understanding Botnets, Cambridge, MA, 2007, pp. 1-9.
2. European Union Agency for Network and Information Security, ENISA Threat Landscape 2012; Jan, 2013, https://www.enisa.europa.eu/activities/risk-management/evolving-threat-environment/enisa-threat-landscape/ENISA_Threat_Landscape.
3. FY. Rashid, Department of labor website hacked to distribute malware; May, 2013, http://www.securityweek.com/department-labor-website-hacked-distribute-malware.
4. J. Pepitone, NBC hack infects visitors in ‘drive by’ cyberattack; Feb, 2013, http://money.cnn.com/2013/02/22/technology/security/nbc-com-hacked-malware/.
5. E. Protalinski, A first: Hacked sites with Android drive-by download malware; May, 2012, http://www.zdnet.com/article/a-first-hacked-sites-with-android-drive-by-download-malware/.
6. HAURI, Malware analysis report of NateOn hacking; Aug, 2012, http://eyesray.tistory.com/attachment/cfile10.uf@1615E0564E3DDD17213F95.pdf.
7. G. Cluley, DarkSeoul: SophosLabs identifies malware used in South Korean internet attack; Mar, 2013, http://nakedsecurity.sophos.com/2013/03/20/south-korea-cyber-attack.
8. ASEC, Malware analysis report using 6.25 DDoS attack; Jun, 2013, http://asec.ahnlab.com/949.
9. KZ. Chen, G. Gu, J. Zhuge, J. Nazario, and X. Han, "WebPatrol: automated collection and replay of web-based malware scenarios," in Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security, Hong Kong, 2011, pp. 186-195.
10. J. Ma, LK. Saul, S. Savage, and GM. Voelker, "Identifying suspicious URLs: an application of large-scale online learning," in Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Canada, 2009, pp. 681-688.
11. J. Ma, LK. Saul, S. Savage, and GM. Voelker, "Beyond blacklists: learning to detect malicious web sites from suspicious URLs," in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 1245-1254.
12. N. Provos, P. Mavrommatis, MA. Rajab, and F. Monrose, "All your iFRAMEs point to us," in Proceedings of 17th USENIX Security Symposium, San Jose, CA, 2008, pp. 1-16.
13. A. Moshchuk, T. Bragin, D. Deville, SD. Gribble, and HM. Levy, "SpyProxy: execution-based detection of malicious web content," in Proceedings of 16th USENIX Security Symposium, Boston, MA, 2007, pp. 1-16.
15. B. Genge, and C. Enachescu, "ShoVAT: Shodan-based vulnerability assessment tool for Internet-facing services,"Security and Communication Networks; 2015, http://dx.doi.org/10.1002/sec.1262.
16. B. Genge, and C. Enachescu, "Non-intrusive historical assessment of internet-facing services in the internet of things," MACRo, vol. 1, no. 1, pp. 25-36, 2015.
17. VirusTotal, [Online]; Available: https://www.virustotal.com/.
19. SY. Choi, IS. Kang, DH. Kim, BN. Noh, and YM. Kim, "Multi-level emulation for malware distribution networks analysis," Journal of the Korea Institute of Information Security & Cryptology, vol. 23, no. 6, pp. 1121-1129, 2013.
20. SpiderMonkey [Online]; Available: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey.