Mining Structural and Behavioral Patterns in Smart Malware

PhD Thesis with distinction

"Nothing in life is to be feared, it is only to be understood."

— Marie Curie —

Malware is definitely not an exception.

Author: Guillermo Suarez-Tangil
Supervisors: Juan E. Tapiador,
Pedro Peris-Lopez
Insitution: Dpto. of Computer Science
at UC3M
Research Visits: Roberto D. Pietro,
Mauro Conti

Abstract

Smart devices equipped with powerful sensing, computing and networking capabilities have proliferated lately, ranging from popular smartphones and tablets to Internet appliances, smart TVs, and others that will soon appear (e.g., watches, glasses, and clothes). One key feature of such devices is their ability to incorporate third-party apps from a variety of markets. This poses strong security and privacy issues to users and infrastructure operators, particularly through software of malicious (or dubious) nature that can easily get access to the services provided by the device and collect sensory data and personal information.

Malware in current smart devices—mostly smartphones and tablets—has rocketed in the last few years, supported by sophisticated techniques (e.g., advanced obfuscation and targeted infection and activation engines) purposely designed to overcome security architectures currently in use by such devices. This phenomenon is known as the proliferation of smart malware. Even though important advances have been made on malware analysis and detection in traditional personal computers during the last decades, adopting and adapting those techniques to smart devices is a challenging problem. For example, power consumption is one major constraint that makes unaffordable to run traditional detection engines on the device, while externalized (i.e., cloud-based) techniques rise many privacy concerns.

This Thesis examines the problem of smart malware in such devices, aiming at designing and developing new approaches to assist security analysts and final users in the analysis of the security nature of apps. We first present a comprehensive analysis on how malware has evolved over the last years, as well as recent progress made to analyze and detect malware. Additionally, we compile together the most cutting-edge open source tools, and we design a versatile and multipurpose research laboratory for smart malware analysis and detection.

Second, we propose a number of methods and techniques aiming at better analyzing smart malware in scenarios with a constant and large stream of apps that require security inspection. More precisely, we introduce Dendroid, an effective system based on text mining and information retrieval techniques. Dendroid uses static analysis to measures the similarity between malware samples, which is then used to automatically classify them into families with remarkably accuracy. Then, we present Alterdroid, a novel dynamic analysis technique for automatically detecting hidden or obfuscated malware functionality. Alterdroid introduces the notion of differential fault analysis for effectively mining obfuscated malware components distributed as parts of an app package.

Next, we present an evaluation of the power-consumption trade-offs among different strategies for off-loading, or not, certain security tasks to the cloud. We develop a system for testing several functional tasks and metering their power consumption. Based on the results obtained in this analysis, we then propose a cloud-based system, called Targetdroid, that addresses the problem of detecting targeted malware by relying on stochastic models of usage and context events derived from real user traces. Based on these models, we build an efficient automatic testing system capable of triggering targeted malware.

Finally, based on the conclusions extracted from this Thesis, we propose a number of open research problems and future directions where there is room for research.

Motivation

This Thesis identifies two fundamental open issues where research is needed: There is more malware than even before, and it is increasingly sophisticated.

P1: Sustained growth in the number of malicious apps targeting smart devices.

Malware has become a rather profitable business due to the existence of a large number of potential targets and the availability of reuse-oriented malware development methodologies that make exceedingly easy to produce new samples. The impressive growth both in malware and benign apps is making increas- ingly unaffordable any human-driven analysis of potentially dangerous apps. This is especially critical as current trends in malware engineering suggest that malicious software will continue to grow both in number and sophistication. As a result, market operators and malware analysts are overwhelmed by the amount of newly discovered samples that must be analyzed. This is further complicated by the fact that determin- ing which applications are malicious and which are not is still a formidable challenge, particularly for grayware.

This has motivated the need for automated analysis techniques and instruments to alleviate the workload of performing intelligent security analysis over software. For instance, when confronted with a continuously growing stream of incoming malware samples, it would be extremely helpful to differentiate between those that are minor variants of a known specimen and those that correspond to novel, previously unseen samples. Grouping samples into families, establishing the relationships among them, and studying the evolution of the various known “species” is also a much sought after application.

P2: Increase in the sophistication of malicious apps and the rise of a new generation of smart malware.

Malware for current smartphone platforms is becoming increasingly sophisticated and developers are progressively using advanced techniques to defeat malware detection tools. On one hand, smartphone malware is becoming more and more stealthy and recent specimens are relying on advanced code obfuscation techniques to evade detection. These techniques create an additional obstacle to malware analysts, who see their task further complicated and have to ultimately rely on carefully controlled dynamic analysis techniques to detect the presence of potentially dangerous pieces of code. On the other hand, the presence of advanced networking and sensing functions in the device is giving rise to a new generation of smarter malware. These malware instances are characterized by a more complex situational awareness, in which decisions are made on the basis of factors such as the location, the user profile, or the presence of other apps.

This state of affairs has consolidated the need for smart analysis techniques to aid malware analysts in their daily functions. This challenge has to be tackled by novel methods to efficiently support market operators and security analysts. In some cases, this problem cannot be solved by market operators alone or by enhanced security models, as they really depend on each user’s privacy preferences. For example, a leakage of data such as one’s location or the list of contacts might well constitute a serious privacy issue for many users, but others will simply not care about it.

The situation described above inevitably leads to the need for more sophisticated analysis techniques. This, however, poses an important challenge: many devices suffer from strong limitations in terms of power consumption, so certain security tasks executed on the platform may be simply unaffordable. External analysis performed on the cloud in near real time can constitute an alternative. Such a strategy seeks to save battery life by exchanging computation and communication costs, but it still remains unclear whether this is optimal or not in all circumstances. Furthermore, the rise of targeted—user-specific—malware poses one additional challenge: conducting particularized analysis for specific user and execution context.

Objectives

The main goal of this Thesis is to study methods, tools and techniques to assist security analysts and final users in the analysis of untrusted apps for smart devices and automate the identification of smart malware.

To achieve this goal, we will focus in the following three general objectives:

Study the evolution and current state of malware for smart devices, as well as recent progress made to analyze and detect it.
Develop techniques aiming at better analyzing malware in large scale software markets, with particular emphasis on intelligent instruments to automate parts of the analysis process.
Facilitate the analysis of complex smart malware in scenarios with a constant and large stream of apps on target. Examples of such sophistication include malware targeting user-specific actions, malware hindering detection with ad- vance obfuscation techniques, or malware exploiting the battery limitations of current devices, to name a few.

Contributions

This Thesis provides several contributions in the field of smart malware detection for smart devices aligned with the goals discussed in the objectives above. These contributions are grouped into four related areas, which corresponds to the four central parts of this document: (i) Foundations and Tools, (ii) Static-based Analysis, (iii) Dynamic-based Analysis, and (v) Cloud-based Analysis.

Foundations and tools. Part I presents the current state of malware analysis and provides a framework for investigating different analysis and detection strategies for untrusted or malicious code. The following two contributions are presented:

A comprehensive analysis of the evolution of untrusted code for smart devices and current detection strategies. Chapter 2 provides a characteriza- tion of current malware’s main features together with an in-depth analysis of both malware and grayware evolution. We identify exhibited behaviors, pursued goals, infection and distribution strategies, etc. and provide numerous exam- ples through case studies of the most relevant specimens. This chapter also includes a careful review of current detection techniques and presents a taxonomy that provides a comprehensive analysis of their strengths and weaknesses.
A research lab of malware for smart malware analysis and detection. The comprehensive study described in Chapter 2 suggest the need of a versatile and multipurpose research laboratory for smart malware analysis and detection. Chapter 3 presents a new generation lab and describes the three building-blocks of its architecture: (i) static-, dynamic-, and cloud-based analysis system. Each system is built on a number of open source tools that facilitate the extraction of security features from apps—static features from the apps’ components and also dynamic characteristics obtained from their execution. The lab incorpo- rates both physical and virtual devices. These devices are instrumented with cutting-edge tools for monitoring a great number of features: ranging from (i) hardware-based signals, such as the battery consumption, to (ii) kernel-based features such as the system calls. The lab also includes a dataset composed of a sizable number of apps crawled both from legitimate online markets and malicious public and private repositories. This new generation lab is shown to be paramount for the evaluation of all contributions presented in this Thesis, and extremely useful for automating malware analysis for smart detection.

Static-based Analysis. Part II exploits the use of static features to assist the security analyst in the large scale analysis of malware families:

A text mining approach for analyzing and classifying malware families. Chapter 4 analyzes several statistical and semantic features to facilitate the identification of malicious code components and their similarity to other apps. This Chapter shows how static analysis can be used to classify malware with a technique named Dendroid. Dendroid a system based on text mining and information retrieval techniques used for automating parts of the malware analysis process. This approach is motivated by a statistical analysis of the code structures found in a dataset of Android OS malware families, which reveals some parallelisms with classical problems in information retrieval domains. To this regard, we adapt the standard Vector Space Model and reformulate the modeling process followed in text mining applications. This enables us to measure similarity between malware samples, which is then used to automatically classify them into families. We also investigate the application of hierarchical clustering over the feature vectors obtained for each malware family. The resulting dendrograms resemble the so-called phylogenetic trees for biological species, allowing us to conjecture about evolutionary relationships among families. In fact, this contribution reveals that current malware families abuse from a reuse-oriented development methodology, which boosts static-based detection strategies.

Dynamic-based Analysis. Part III compile efforts based on the dynamic execution of untrusted code and the analysis of its resulting behavior. The following fundamen- tal contribution is tackled:

Differential fault analysis of obfuscated malware behavior. Obfuscated malware provides attackers with the ability to evade static analysis. Chapter 5 introduces a dynamic-based detection technique called Alterdroid for identifying obfuscated malware on large-scale analysis scenarios. Alterdroid provides security analysts with a framework capable of automating the identification of obfuscated components distributed as parts of an app. The key idea in Alterdroid consists of analyzing the behavioral differences between the original app and a number of automatically generated versions of it where a number of modifications (faults) have been carefully injected. Observable differences in terms of activities that appear or vanish in the modified app are recorded, and this signature is finally analyzed through a pattern-matching process driven by rules that relate different types of hidden functionalities with patterns found in the differential signature.

Cloud-based Analysis. Part IV contains two contributions related to the use of the cloud to offload detection strategies from devices. The first contribution explores the question of offloading—or not—general anomaly-based detection strategies. The second contribution stands over the conclusions extracted from the first one, and approaches the detection of targeted malware using a cloud-based strategy. We next summarize each one:

Power-aware anomaly detection in smartphones. Many recent works simply assume that on-platform detection is prohibitive and suggest using offloaded (i.e., cloud-based) engines. Chapter 6 studies different security tasks involved in the detection of malware in built-in detection systems. Specifically, it focuses on machine learning based anomaly detection systems, as they are widely used to build both static and dynamic detection techniques. This chapter studies the power-consumption trade-offs among different strategies for off-loading, or not, those security tasks. It also shows that outsourced detection strategies are clearly the best option in terms of power consumption when compared to on-platform detection. This contribution also points out noticeable differences among different machine learning algorithms, and provides separate consump- tion models for functional blocks (data preprocessing, training, test, and communications) that can be used to obtain power consumption estimates and compare detectors.
A stochastic behavioral-triggering model for targeted malware detection. Targeted malware challenges current dynamic-based detection strategies as analysts must reproduce very specific activation conditions to trigger malicious payloads. Furthermore, the consumption model presented in Chapter 6 shows that the use of detection techniques built in the device is unaffordable. Chapter 7 proposes a cloud-based system, called Targetdroid, to facilitate the detection of this type of malware. The contribution presented in this chapter relies on automatically learned stochastic models of usage and context events derived from real users. This chapter reveals several interesting particularities of apps usage patterns that allow for an efficient generation of testing patterns. This contribution shows that testing patterns automatically is feasible, specially when this is done in conjunction with a cloud infrastructure.

Finally, Part V presents the main conclusions, analyzes the contributions of this Thesis and the published results, and discusses open research problems and future work. This part also comprises the references and appendices.

Results and Publications

All contributions resulting from this Thesis have been sent for publication to top ranked peer reviewed journals and international conferences in the Computer Science area. Furthermore, software produced as a result of this Thesis has been sent for copyright protection and made available for fair use to the research community.

We list all publications that arise from this Thesis organized by contribution:

P1: “Evolution, Detection and Analysis of Malware for Smart Devices”

Authors: Guillermo Suarez-Tangil, Juan E. Tapiador, Pedro Peris- Lopez, and Arturo Ribagorda.
In: IEEE Communications Surveys and Tutorials, In press (2013).
I.F. (2012): 4.81.
Position in category: 2/132 (Q1) in Computer Science.

P2: “Dendroid: A Text Mining Approach to Analyzing and Classifying Code Structures in Android Malware Families”.

Authors: Guillermo Suarez-Tangil, Juan E. Tapiador, Pedro Peris- Lopez, and Jorge Blasco.
In: Expert Systems with Applications (Elsevier), Vol. 41:4, pp. 1104- 1117 (2014).
I.F. (2012): 1.85.
Position/Category: 56/243 (Q1) in Engineering.

P3: “Thwarting Obfuscated Malware via Differential Fault Analysis”.

Authors: Guillermo Suarez-Tangil, Flavio Lombardi, Juan E. Tapiador, and Roberto Di Pietro.
In: IEEE Computer, In press (2014).
I.F. (2012): 1.68.
Position in category: 9/50 (Q1) in Computer Science.

P4: “Detecting Targeted Smartphone Malware with Behavior- Triggering Stochastic Models”.

Authors: Guillermo Suarez-Tangil, Mauro Conti, Juan E. Tapiador, and Pedro Peris-Lopez.
To: European Symposium On Research In Computer Security (ESORICS), December 2014.
Rank (2013): CORE A in Computer Software.

P5: “Alterdroid: Differential Fault Analysis of Obfuscated Malware Behavior”

Authors: Guillermo Suarez-Tangil, Juan E. Tapiador, Flavio Lombardi, and Roberto Di Pietro.
Paper submitted.

P6: “Power-aware Anomaly Detection in Smartphones: An Analysis of On-Platform versus Externalized Operation”.

Authors: Guillermo Suarez-Tangil, Juan E. Tapiador, Pedro Peris- Lopez, and Sergio Pastrana.
In: Pervasive and Mobile Computing, In press (2014)
I.F. (2012): 1.63 (Q1)

Summary of the results and related publications:

Future Works

Malware in smart devices still pose many challenges and a number of important issues need to be further studied and addressed with novel solutions:

Stegomalware: In the case of smart malware, one commonly observed technique consists of hiding modules containing malicious functionality in places that static analysis tools overlook (e.g., within data objects). More sophisticated hiding techniques, particularly in code, are starting to materialize. These techniques and trends create an additional obstacle to malware analysts, who see their task further complicated and have to ultimately rely on carefully controlled dynamic analysis techniques, such as Alterdroid, to detect the presence of potentially dangerous pieces of code. We believe that smart malware could be using advanced techniques, such as steganography, for concealing their mod- ules within another components of the code. This is specially critical when this components are hidden within distinguishable components (see Alterdroid— Chapter 5).
- Paper submitted: Stegomalware: Hindering Malware Detection via Steganography in Smart Devices.
Cooperative security: In the near future it is very likely that many users will own a network of smart devices, including smartphones, smart TVs and other home appliances, and wearable computing platforms. Such networks could be leveraged to implement cooperative security functions, as a complement to cloud-based and on-platform monitoring and analysis mechanisms. Ideally, several connected devices could cooperate to improve security in a number of ways. For example, resource-intensive tasks can be delegated to devices with a permanent power source to preserve the battery of mobile platforms. Similarly, mutually monitoring schemes could be interesting, where each device monitors the behavior of others to detect compromise.
Trusted Software: In the case of current smartphones and tablets, trust on the non-malicious nature of an app is based on two factors: (i) the implicit assumption that the market operator has conducted some security review before making the app available for download; and (ii) the identity of the developer, given by the signature attached to the app, which also provides some evidence of the app’s integrity. The first point is not fully reliable, as operators cannot afford to carry out an exhaustive analysis over every submitted app; and, even if they could, there is still some non-negligible probability of sophisticated malware evading detection. As for the identify of the developer and the app’s integrity, evidence suggests that most users do not pay much attention to them, or positively ignore them when downloading apps from alternative markets. We believe that further efforts to improve trust in software are required. This will be increasingly necessary in the near future, as the number of developers— and, hence, apps—will likely grow very significantly. Reputation systems [Viriya- sitavat and Martin, 2012; Zacharia et al., 2000] adapted to this context might offer some added value, in particular by exploiting interactions in large user communities such as, for example, those provided by online social networks [Govindan and Mohapatra, 2012]. But other mechanisms for building trust could also apply, such as for example remote attestation protocols [Nauman et al., 2010; Saroiu and Wolman, 2010; Viriyasitavat and Martin, 2012] or any other schemes to ensure the authenticity and integrity of software.
Malware in Other Smart Devices: The experience gained from current smart- phones suggests that malware will also hit other smart devices as soon as they appear. Evidence in other pervasive technologies already exists. For example, nowadays Radio Frequency Identification (RFID) systems are used in a wide range of applications, such as transport tickets, access control systems, e-passports, e-health applications, etc. The benefits of adopting RFID technology for identification purposes are clear, but its associated security risks need to be addressed. One of them—often underestimated—is malware. The use of Internet-enabled mobile devices as RFID readers makes this sort of attacks potentially more harmful. Most previous works have focused on securing the communication link between the tag and the (mobile) reader. There are, however, some preliminary works [Rieback et al., 2006; Yan et al., 2009] on RFID malware, but further studies and solutions are required. Similarly, Implantable Medical Devices (IMDs) and other medical devices will likely be an attractive target for attackers due to the economic value of the information they can provide [Burleson et al., 2012; Clark and Fu, 2012; Clark et al., 2013].
Forensics-based analysis for smart device protection: Sometimes malicious programs uninstall themselves after achieving their goals. However, analyzing evidences that they leave behind could be used as an input for detecting future propagation using the same infection vector. Identifying such traces is a great challenge, particularly due to the availability of anti-forensic tools for devices such as smartphones [Distefano et al., 2010]. In this regard, two different approaches might be worth exploring. On the one hand, deleting evidences or attempting to neutralize any source of evidence usually produces fresh new evidences. On the other hand, new paradigms such as the aforementioned replicas in the cloud, allow the creation of novel forensic approaches on the cloud based on virtual introspection.
Offloaded security: Applications are increasingly requiring the user to autho- rize the transference of personal information to the cloud as part of the normal use of the application. For instance, WhatsApp sends the user’s address book to establish friendship connections [WhatsApp, 2014]. However, even if the user authorizes such a transference, it does not mean that it will be used for purposes other than those conveyed to the user, such as for example market research. In other cases, users are only informed that some personal information will be sent, but the particulars about what specific items or how it will be used are not given. Identifying misuse of personal information, both on-platform and in the cloud, is a challenging problem that is typically tackled by legal enforcement mechanisms, but technical approaches should be explored. For instance, in the same way that Google App Engine [Google, 2014a] is used to deploy in-the-cloud applications—monitored by Google—, back-end services for smartphones and other smart devices could be moved to a cloud controlled and monitored by a trusted third party. This could make feasible to monitor behavior and enforce security policies in the cloud end of the service, thus complementing other security mechanisms applied in the device. Similar privacy-related problems arise in cloud-based monitoring schemes, primarily in those that maintain a virtualized replica of the device to carry out mon- itoring tasks that are unaffordable to perform directly on the device. Privacy- preserving monitoring systems for this scenario are required, but also more lightweight monitoring and detection mechanisms that can run on platform with an appropriate balance between efficacy and power consumption.

Research Visits

The research visits performed diring the Thesis are:

Università degli Studi di Roma Tre: I visited Dr. Roberto Di Pietro between September and December 2012. Resulting from this visit, we have published “Thwarting Obfuscated Malware via Differential Fault Analysis” in IEEE Computer 2014, and submitted “Alterdroid: Differential Fault Analysis of Obfus- cated Malware Behavior” to an IEEE Transactions 2014. We are currently working on other proposals.
Università degli Studi di Padova: I visited Dr. Mauro Conti between September and October 2013. As a result of this visit, we have published the following contribution “Detecting Targeted Smartphone Malware with Behavior-Triggering Stochastic Models” in ESORICS 2014. We are currently working on extending our proposal.