HyperLab

HyperLab | 基于反洗钱平台对Deribit盗币事件的分析和追踪

hyperlab@newsletter.paragraph.com (HyperLab) — Wed, 02 Nov 2022 13:45:53 GMT

事件分析：

@DeribitExchange于11月2日发布了Deribit热钱包被盗的消息

此次事件涉事地址包括：

0x58F56615180A8eeA4c462235D9e215F72484B4A3 （Addr1）

0x8d08aAd4b2BAc2bB761aC4781CF62468C9ec47b4（Addr2）

0xb0606F433496BF66338b8AD6b6d51fC4D84A44CD（Addr3）

下文会分别以Addr1，Addr2和Addr3代替。

Hyperlab安全分析员通过HyperLab AML平台分析得出：11月1日 (11:57:11 PM +UTC)，Addr1地址先后分别转出6947 ETH， 0.65 ETH和20 ETH到Addr2地址。Addr2地址随后通过UniSwap进行了多次换币，换币总金额价值2143.94ETH，涉及的换币币种有USDC，WBTC， DAI和USDT。之后又分两笔金额分别为9080.18ETH和31.40ETH的交易转入地址Addr3。

截止发文时，被盗资产仍存储在Addr3地址中。事件发生第一时间，HyperLab对涉事地址迅速进行了标记。HyperLab将对相关地址动态进行持续追踪。

图1. Hyperlab AML平台对涉事地址0xb0606F433496BF66338b8AD6b6d51fC4D84A44CD的分析和标记

图2. Hyperlab AML平台对涉事地址0x8d08aAd4b2BAc2bB761aC4781CF62468C9ec47b4的分析和标记

图3. Hyperlab AML平台对本次盗币事件交易的追踪

HyperLab AML平台：

Hyperlab AML平台分析了数百个全球交易所、混币器、洗钱系统、赌博服务系统和已知的犯罪分子的钱包地址，利用机器学习和深度学习算法快递收集加密货币交易中产生的各种变量，最终通过算法对可疑地址和交易进行关联，从而对该交易进行风险评估。

为了方便专业用户的使用，Hyperlab 创建了一个强大的应用程序接口 (API)，该接口可以快速对所有的加密货币交易进行风险评估。用户在通过API获得相应的评估结果之后，可自行决定是否继续深入调查追踪该笔交易是否违反了当地的反洗钱政策以及相应法规。如果用户需要，该API 还可以自动生成一个更深入，详细的分析报告以满足相应供监管机构的需求。

此外，针对没有相关知识背景的普通用户，Hyperlab AML平台还有另外一个交互式界面，支持此类用户对某些钱包地址进行深入调查分析，并为其提供评估结果，从而保护普通用户的财产及人身安全。

想要获得更多的信息？您可以访问我们的官方网站https://www.hyperlab.org/ 从而更加深入详细的了解我们的所有服务。

跨链 DEX 聚合器 Transit Swap 攻击分析

hyperlab@newsletter.paragraph.com (HyperLab) — Mon, 03 Oct 2022 08:02:49 GMT

10月2日，跨链 DEX 聚合器 Transit Swap 遭受攻击，导致大量用户的资金从钱包中被取出，预计损失超 2100 万美元。Transit Finance称，各方安全公司和项目组仍在继续追踪黑客事件，并通过邮件、链上等方式与黑客进行沟通。项目组正在抓紧收集被盗用户的具体数据，制定具体的返还方案。同时团队将继续追回黑客被盗资产的剩余资产，并将其归还给丢失的用户。截止发稿前，黑客已经归还约70%的资产。

黑客地址：0x75F2abA6a44580D7be2C4e42885D4a1917bFFD46

Hyperlab 安全实验室根据链上交易进行分析，此次攻击发生的要原因在于 Transit Swap 协议在进行代币兑换时并未对用户传入的数据进行严格检查，导致了任意外部调用的问题。

具体来说，用户在用Transit Swap进行兑换时，会调用一个代理合约(0x8785bb8deae13783b24d7afe250d42ea7d7e9d72)。这个代理合约会根据代币的种类选择路径。紧接着路由桥合约（0x0B47275E0Fe7D5054373778960c99FD24F59ff52）会调用权限管理合约（0xed1afc8c4604958c2f38a3408fa63b32e737c428）的claimtokens进行转账。而 claimTokens 函数是通过调用指定代币合约的 transferFrom 函数进行转账的。其接收的参数都由上层路由桥合约传入，本身没有对这些参数进行任何限制，只检查了调用者必须为路由代理合约或路由桥合约。

攻击者利用路由代理合约、路由桥合约与权限管理合约未对传入的数据进行检查的缺陷，通过路由代理合约传入构造后的数据调用路由桥合约的 callBytes 函数。callBytes 函数解析出攻击者指定的兑换合约与兑换数据。此时兑换合约被指定为权限管理合约地址，兑换数据被指定为调用 claimTokens 函数将指定用户的代币转入攻击者指定的地址中。由此攻击者实现了窃取所有对权限管理合约进行授权的用户的代币。

Hyperlab 安全实验室提醒，在跨合约的函数调用时，合约开发者需要对用户传入的数据进行严格检查，并对下层对上层给的返回值进行正确校验。对Transit Swap用户而言，需停止使用Transit Swap，并提高自身对区块链钱包的安全意识，尤其是对DEX 的router的无限授权权限保持谨慎。

Cross-Language Android Permission Specification

hyperlab@newsletter.paragraph.com (HyperLab) — Sat, 23 Jul 2022 03:56:00 GMT

ABSTRACT

TheAndroidsystemmanagesaccesstosensitiveAPIsbypermission enforcement. An application (app) must declare proper permissions before invoking specific Android APIs. However, there is no oficial documentation providing the complete list of permission-protected APIs and the corresponding permissions to date. Researchers have spent significant efforts extracting such API protection mapping fromtheAndroidAPIframework,whichleveragesstaticcodeanaly-sistodetermineifspecificpermissionsarerequiredbeforeaccessing an API. Nevertheless, none of them has attempted to analyze the protection mapping in the native library (i.e., code written in C and C++), an essential component of the Android framework that handles communication with the lower-level hardware, such as cameras and sensors. While the protection mapping can be uti-lized to detect various security vulnerabilities in Android apps, such as permission over-privilege, imprecise mapping will lead to false results in detecting such security vulnerabilities. To fill this gap, we thereby propose to construct the protection mapping in-volved in the native libraries of the Android framework to present a complete and accurate specification of Android API protection. We develop a prototype system, named NatiDroid, to facilitate the cross-language static analysis and compare its performance with two state-of-the-practice tools, termed Axplorer [13] and Arcade [9]. We evaluate NatiDroid on more than 11,000 Android apps, including system apps from custom Android ROMs and third-party apps from the Google Play. Our NatiDroid can identify up to 464 new API-permission mappings, in contrast to the worst-case results derived from both Axplorer and Arcade, where approx-imately 71% apps have at least one false positive in permission over-privilege. We have disclosed all the potential vulnerabilities detected to the stakeholders.

ACM Reference Format:

Anonymous Author(s). 2022. Cross-Language Android Permission Specifi-cation. In Proceedings of The 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ES-EC/FSE 2022). ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/ nnnnnnn.nnnnnnn

![ Figure 1: Android software stack

](https://images.mirror-media.xyz/publication-images/x5PBPecfYRUNlTXoKehjQ.png?height=556&width=609)

1 INTRODUCTION

Android protects access to restricted data (e.g., the device identifier) and actions (e.g., making phone calls) through permission enforce-ment [12]. Such an access control model can protect users against snooping and protect the stability and security of the operating system [31]. When an Android app attempts to access the restricted resources, a security check is triggered to inspect whether proper permissions are granted. Lack of permission request will prevent access to the resource and further cease the corresponding function-ality or even crash the app. Therefore, it is essential for developers to know the permissions required of the invoked API. Unnecessary required permissions can pose three threats: (i) Too many required permissions may confuse users. Users suspect that the app hasunexpected behaviors, which leads to users uninstalling or unwilling to install the app; (ii) The permissions required by an app is an important feature in detecting Android malware. Unnecessary permissions will fool the detector and cause a false alarm; (iii) The app will incur security risks with unnecessary permissions. Once the app contains vulnerabilities that can be injected with tampered code, it is easy for an attacker to thwart user privacy or invoke sensitive APIs. Moreover, requesting unnecessary permissions may expand the attack surface and expose the Android operating system to a host of attacks, especially privilege escalation attacks [17].Therefore, to safeguard users’ privacy and protect the Android ecosystem, Android app developers are suggested to follow the principle of least privilege, i.e., requesting a minimum set of permissions required to fulfill the apps’ functionality. Unfortunately, Android does not provide official documentation for the permission specifications (i.e., a mapping between APIs and the required permissions), making it difficult for app developers to follow the least privilege rule, and further lead to security vulnerabilities such as permission over-privilege [17]. To address this problem, researchers have been working on developing methods that generate an accurate list, called a protection map, that maps Android APIs to the required permissions. The previous works that provide such protection maps include Stowaway [17], PScout [12], Axplorer [13], Arcade [9], PSGen [39], and most recently, Dynamo [14]. Stowaway empirically determines the permissions required in Android APIs using feedback-directed testing. PScout and Axplorer leverage control-flow reachability analysis on the source code of the Android framework to generate the mapping between APIs and permissions. Arcade proposes a path-sensitive method based on graph abstraction techniques to generate a more precise mapping. PSGen statically generatesthe protection map for Android NDK. Dynamo extracts the protection mapping through dynamic test. Dynamic testing methods (e.g., Stowaway and Dynamo) can accurately map the required permissions to API invocations that they have tested; however, such dynamic approaches suffer from an intrinsic shortcoming of low coverage. The existing static analysis based approaches (e.g., PScout, Axplorer, and Arcade) have better coverage but may lead to imprecise results because of improper modeling of the complicated Android communication mechanisms. PSGen extracts protection mapping for Android NDK (i.e., both protected APIs and permission checks are in the Native code), which has a different scope to our approach (i.e., the protected APIs are in the Java framework, while the permission checks are in the Native code).

Specifically, existing works only analyzed the Java API framework in the Android API Framework, but overlooked the C/C++ Native Library that consists of core Android system components and services (e.g., Camera service, Sensor service). For example, the public method openCamera() in CameraManager.java class implements its permission check (“android.permission.CAMERA”) in the native library CameraService.cpp. Missing native library analysis will mistakenly conclude that the API openCamera() does not require any permissions (one example is detailed in Section 2). While the API-permission protection mapping contributes to identifying security vulnerabilities in Android apps, such as permission over-privilege [17] (i.e., an app requests additional permissions that are not required), the imprecise mapping will lead to false results on detecting such vulnerabilities. Taking the aforementioned case as an example, an app invokes the API openCamera() will need to request the corresponding permission android.permission.CAMERA; however, existing works that do not analyze the native libraries will identify it as a permission overprivilege case, and henceforce, a false positive. Static analysis is widely used in code analysis due to its fast speed and high code coverage, especially in the field of Android security analysis [11, 21, 22, 24, 26, 27, 32, 38]. There are also many works on vulnerability identification of Android operation systems [10, 15, 16, 23, 25, 28–30, 33, 35]. In this research, to address the shortcomings of the existing works, we leverage the cross-language analysis on the overall Android system, including both the Java API Framework and the C/C++ Native Libraries. To this end, we analyze the cross-language communication mechanisms on five Android Open Source Projects (AOSP) and summarize two communication models to facilitate the cross-language analysis. We develop a prototype system, NatiDroid, and generate Native-triggered (i.e., an Android API whose permission check is implemented in the native library) API-permission mappings in AOSP versions 7.0, 7.1, 8.0, 8.1, and 10.0, which were chosen to benchmark against prior works [9, 13] (see detailed discussion in Section 5). In addition to the mappings generated by previous works, Axplorer and Arcade (2,115 and 1,585, respectively, in AOSP 7.0), NatiDroid can successfully discover 449 mappings that are not covered previously.

Note that while most Android APIs are Java methods, fewer of them are C/C++ methods. Nevertheless, these native methods play indispensable roles in the Android system, such as interacting with the hardware layer. We further use the new mappings to detect permission over-privilege vulnerabilities on a large dataset containing more than 11,000 Android apps, including system apps from custom Android ROMs and third-party apps from the Google Play. We identify the worst-case scenario, where approximately 71% apps with permission over-privilege detected by Axplorer and Arcade are false positives. In summary, we make the following contributions:

• We design and implement a prototype system, NatiDroid, to

facilitate cross-language control-flow analysis on the Android

framework. To the best of our knowledge, this is the first work

to enable cross-language analysis on the Android framework.

By incorporating NatiDroid with existing Java-side permission

mappings (e.g., Axplorer or Arcade), we obtain a complete per

mission mapping that covers the entire Android system. We make

our system and results publicly available to benefit researchers

and practitioners.1

• We apply NatiDroid to extract the permission-API protection mappings from the native libraries on four AOSP versions (7.0, 7.1, 8.0, 8.1, and 10.0). We show that 12 permissions, including 8 signature and 2 dangerous permissions, are determined to be enforced in native libraries, which are not covered by two state of-the-practice benchmarks, Axplorer and Arcade.

• We analyze Android apps for permission over-privilege. Our results show that NatiDroid is effective in identifying vulnerable Android apps. We have identified approximately 71% false positives in terms of the number of the apps with at least one permission over-privileged.

We hope that the proposed system, NatiDroid in this paper could bridge the gap between Java- and Native-sides analysis (see Figure 1), rendering the static analysis of the overall Android framework to be complete and accurate.

2 BACKGROUND AND MOTIVATION

This section provides background information on how Android OS operates and explains the limitations of the existing static API protection mapping generation techniques that motivate our work. Android framework. Android framework consists of the Java API Framework layer and C/C++ Native Library layer (i.e., the second and the third layers from the top in Figure 1). The Java API Framework layer offers Application Programming Interfaces (APIs) written in the Java language for Android app developers to access Android features and functionalities. The Java framework access the device hardware capabilities, such as the camera and sensors, via the C/C++ Native Library layer. When a Java framework API (e.g., the Camera Manager in Figure 1) invokes a call to access device hardware, the Android system loads corresponding library module (e.g., the Camera Service) for that hardware component.

Android permission model. When an app needs to use any of the protected features of an Android device (e.g., accessing the camera, sending an SMS, making a phone call), it must obtain the appropriate permission(s) from the user. When an Android API is called, the Android framework checks whether it is protected by any permission. Most of such permission checks are defined in the Java API Framework layer in the Android system, while there are yet a number of them defined in the C/C++ Native Library layer.

Existing works [9, 12, 13] leverage static analysis on the Java API Framework layer of the Android framework to extract the mapping between APIs and corresponding permission checks. Ignoring the invocation of native libraries miss the permission checks in the native libraries, leading to incomplete mapping results.

Figure 2: Motivation example derived from a real-world Android app vStudio.Android.Camera360. The code is simplified for better illustration.

Motivating example. We further elaborate on our motivation with a real-world example illustrated in Figure 2. The code from lines 1 to 10 is derived from a popular photography app vStudio.Android.Camera360 [8] on Google Play. The app initialises a CameraManager instance (line 4), and opens the camera instance by invoking openCamera() method (line 8). The invocation chain then traverses along the call path through an Android SDK class CameraManager.java (lines 11 to 18) and a native library CameraService.cpp (lines 19 to 32), and finally triggers a permission check in the native library (line 28). Note that the cameraService.connectDevice() (line 18) communicates with the CameraService::connectDevice() (line 20) in a cross-language way (marked as purple). The security check examines if the method is called by its own process (i.e., cameraservice, hence, no permission is required) or the corresponding permission android.permission.CAMERA is granted. If neither it is called from its own process, nor the android.permission.CAMERA permission is granted, a permission denied error is returned (line 30), which will further prevent openCamera() to be executed. This example implies a protection mapping from the Android API, CameraManager.openCamera(), to itspermission protection check, {android.permission.CAMERA || callingPid == getpid()}.2

Unfortunately, as existing works only analyzed the Java source code in the Android framework, they miss the permission checks implemented in the native libraries. For instance, the mapping of the API CameraManager.openCamera() to the permission android.permission.CAMERA, as shown in the example, does not exist in the state-of-the-practice works, such as PScout [12], Axplorer [13], or Arcade [9]. The incompleteness of the mapping results further introduces false results in detecting security vulnerabilities, such as permission over-privilege [17] (detailed in Section 4.3)

3 APPROACH

We propose and implement a prototype system, NatiDroid, to address the cross-language protection mapping problem that has long been overlooked in previous works. Figure 3 illustrates the overall design of NatiDroid. As depicted, NatiDroid contains three modules. The Pre-processing module prepares the intermediate artifacts for analyzing the Android framework and native libraries, such as intermediate .jar files (for Java-side analysis) and Clang compile commands (for Native-side analysis). The Entry-points identification module summarizes two cross-language communication models used in the Android framework, and identifies the entry-points for both Java- and Native-sides analysis. The Cross-language Control Flow Graph (CFG) analysis module constructs the cross-language CFG and extracts the permission mapping.

We propose a complete solution for extracting Native-triggered permission mapping from the Android system. We leverage Soot [1] and Clang [2] static analysis frameworks, although our solution is also applicable to other static analysis frameworks. Soot is a popular Java optimization framework for analyzing and visualizing Java and Android apps, which has been widely used in various projects for static analysis [12, 18, 19]. Clang is a lightweight compiler for C language family. We use Clang to transform C/C++ code to the Abstract syntax tree (AST) [3]. Additional code for implementing NatiDroid consists of approximately 7kLOC. We detail the design and implementation of each module in the following subsections.

3.1 Pre-processing

Due to the complexity and cross-language nature of the Android framework, there is no off-the-shelf tool for static analysis of the Android framework (i.e., Java API Framework and Native Library) as a whole. NatiDroid leverages the divide-and-conquer strategy to facilitate the Java- and Native-sides analysis. However, there are still non-trivial tasks to prepare the AOSP codebase for the static analysis. Hence, in this module, we prepare the intermediate artifacts from the AOSP codebase, which are required to enable the Java- and Native-sides analysis. Note that the pre-processing module includes most engineering works, which is not considered our technical contribution. However, it is essential to facilitate the proposed cross-language analysis.

![ Figure 3: An overview of NatiDroid system

](https://images.mirror-media.xyz/publication-images/4MIJk1yb0zMz2w-UXK5ec.png?height=365&width=1087)

Java-side analysis preparation. NatiDroid’s Java-side analysis takes compiled .jar file as input. However, to maintain the stability of the Android system, some non-SDK class and method bodies are hidden using the @hidden Javadoc tag (e.g., non-SDK Android APIs that may be changed in the future versions without noticing the app developer) during the building of android.jar from source code. The hidden classes and methods only expose the method name, parameters, return values, and minimum set of statements required to handle the invocation, which is not sufficient for constructing a complete CFG. We therefore retain the intermediate output during the compilation, i.e., the intermediate .jar files that have not been combined as android.jar. These intermediate .jar files, such as services.com.intermediates.jar, have the complete class and method information sufficient for facilitating static analysis.

Native-side analysis preparation. Before we build the crosslanguage CFG (cf. Section 3.3), we leverage Clang to transform C/C++ source code to AST. A complete set of Clang compile command is required to enable the static analysis, however, is not provided in Android documentation. Android uses the ninja to build system [6]. During the compilation process, the .ninja files containing ninja build commands are generated by the compiler.

However, the commands obtained from .ninja files consist of file operations and a mixture of GNU Compiler Collection (GCC) and Clang commands, which are not compatible with the off-the-shelf Clang-based analyzer. We then develop a system (to 500 LOC) to extract and pre-process the required commands from these files. The functions of the system include merging separated ninja commands and replacing the Clang++ commands with Clang commands (i.e., adding C++ headers in Clang command’s parameters).

3.2 Entry-Points Identification

Recall that the overall idea of generating a protection mapping is to examine whether the invocation of an API will trigger a permission check in the Android framework (i.e., if there is a permission check node in the CFG starting from the API call). Due to the complexity of the Android framework, building a CFG of the overall framework is neither practical nor efficient. As NatiDroid’s goal is to complement the existing mappings, such as Axplorer and Arcade, by adding the protections whose permission checks are located in the native libraries, we only generate sub-CFGs for the Android APIs that involve cross-language communication. The first step to generate sub-CFGs is to identify the entry-points of the graphs (for both Java- and Native-sides). To this end, we first summarize two cross-language communication mechanisms used by Android.

![ Figure 4: AIDL communication model. The Java-side caller invokes remote method from Native-side.

](https://images.mirror-media.xyz/publication-images/TmQrkwSt5j7Io92h32gxu.png?height=218&width=521)

AIDL-based communication model. The Android operating system (OS) is based on Linux but does not inherit the Inter-Process Communication (IPC) mechanism of Linux. In Android, resources between processes are not shared. Android Interface Definition Language (AIDL) is one of the IPC mechanisms that Android adopted to enable communication for remote invocation, providing crossprocess or even cross-language communication. Figure 4 depicts the workflow of AIDL-based client-service model, where the Java framework works as a client requesting service from the native library. AIDL utilizes a pair of Stub/Proxy classes on Java-side to communicate with native libraries. The Proxy is responsible for sending requests to native service and implementing the remote method which invokes the transact() method and communicates with Native-side, while the Stub class, inheriting the Binder class, transforms service Binder and receives the reply from native service using the method onTransact(). The transact() and on Transact() are synchronous, such that a call to transact() does not return until the target has returned from onTransact().

On Native-side, it is unnecessary to generate Stub/Proxy pairs, but directly implements the remote method (using the name as same as the remote method on Java-side) to handle the request from Java-side, so that we can always find the receiver on Transact() and the same name remote method as the AIDL sender. Through using pairs of onTransact() and transact() methods as sender and receiver on both sides, the communication between Java- and Native-sides is established. Therefore, cross-language interaction can be detected via matching the use of AIDL on Java- and Nativesides (e.g., the Stub/Proxy pair, the onTransact() methods, and the implementation of the remote methods and the transact() methods). We can then determine the entry-points for static analysis accordingly.

![ Figure 5: An example of AIDL-based cross-language communication model. The code snippets are simplified for better illustration.

](https://images.mirror-media.xyz/publication-images/Ivh91dZoFAwRzZ-mHpI79.png?height=522&width=485)

An example of AIDL mechanism is shown in Figure 5, where the method getCameraInfo() uses AIDL to implement the communication between Java and C++. A pair of a sender and a receiver on each side of AIDL (i.e., the client and the service) handles the cross-language communication. The caller (line 3) invokes the method getCameraInfo() which is firstly defined as an interface in line 6 and then implemented in line 14 (the detailed implementation is omitted). In the corresponding native library (lines 16 to 24), the receiver onTransact() handles the request and further invokes the method getCameraInfo() (line 23). The get CameraInfo() method then executes the method implementation and sends the execution result back to the Java-side onTransact() method (line 9), which is further passed back to the caller (line 3). Note that the getCameraInfo() in line 6 and the getCameraInfo() in line 18 are the two interface methods that both invoke the transact() method (omitted) to establish the communication between Java- and Native-sides. The NatiDroid will match the pair of such remote methods and recognize them as a pair of Java entry-point and native entry-point. Note that, for the inner-language AIDL communication on both Java- and Native-sides, NatiDroid generates CFG with a similar approach, except for the identification of entry-points.

JNI-based communication model. Java Native Interface (JNI) provides a cross-language interface between Java and Native code written in C/C++). This enables developers to write native methods to handle situations when an app cannot be written entirely in the Java programming language, e.g., when the standard Java class library does not support the platform-specific features or program library, such as communication with the underlying hardware. In the JNI framework, native functions are implemented in .c or .cpp files. When invoking the function, it passes a JNIEnv pointer, a jobject pointer, and any Java arguments declared by the Java method.

![ Figure 6: JNI communication model. Java- and Native-sides communicate with JNI.

](https://images.mirror-media.xyz/publication-images/KNszCFi-zl3ylckTCQkwH.png?height=279&width=508)

Figure 6 shows the JNI-based communication model adopted by Android. Android uses the JNI Dynamic Register to link native methods. Different from the AIDL model, the JNI-based communication starts from a registration process. When Android starts running, the AndroidRuntime class uses the startReg() method to start the registration of JNI methods, which will further invoke all JNI registration functions implemented in the native libraries.

The registration functions will register native methods with the corresponding Java class specified by an array of JNINativeMethod structures that contains the Java method name, Java method signature, and pointers to the native function. After the registration process, all the JNINativeMethod (on Native-side) is registered and linked to the corresponding Java method in the Java Virtual Machine (JVM).

We further explain the JNI-based communication mechanism with an example given in Figure 7, which is derived from android_hardware_Radio.cpp in AOSP 8.0. The method register_android_hardware_Radio() (line 13) is called to register the JNI methods for Radio, with the JNI method information provided in line 16. Specifically, the kRadioModuleClassPathName variable (line 5) declares the containing class name of the Java method, and gModuleMethods (line 7) declares the correspondence between the Java method and the C++ function. The variable gModuleMethods is defined to contain groups of the Java method name (line 8), parameter and returned types (lines 9 to 10), and the pointer of C++ method (line 11). All the information will be dynamically registered in JVM during run-time. Finally, the C++ method involved in the crosslanguage communication is declared in line 12, while the involving Java method can be found in the java file RadioModule.java in package android.hardware.radio (lines 2 to 3).

![ Figure 7: An example of JNI-based cross-language communication. The code snippets are simplified for better illustration.

](https://images.mirror-media.xyz/publication-images/czCJ-K5HtSWBYHp6ItYOq.png?height=338&width=494)

According to the JNI-based communication mechanism, we extract pairs of entry-points from Java- and Native-sides. Specifically, the JNI methods and the corresponding native methods are vaguely identified by a linear sweep searching of keyword RegisterMethodsOrDie, registerNativeMethods and jniRegister NativeMethods over the .cpp files at first. Then we extract the class path name (e.g., the kRadioModuleClassPathName in Figure 7) and the array of JNINativeMethod structures, from which a pair of entry-points can be located and recognized (e.g., the pair of native_setup() and android_hardware_Radio_setup()).

3.3 Cross-Language Protection Mapping Extraction

After the entry-points on both Java- and Native-sides are identified, we further extract the protection mappings from AOSPs. In this section, we first introduce how NatiDroid generates CFG from both sides. Based on the CFG, NatiDroid then traverses the crosslanguage Android API call paths and corresponding security checks (e.g., permission checks) to generate the API-permission protection mappings.

3.3.1 Cross-language CFG Generation. We elaborate the detailed steps involved in generating cross-language CFGs. After obtaining the entry-point pairs from both Java- and Native-sides (as detailed in Section 3.2), NatiDroid first leverages a forward analysis to generate a CFG on the Native-side from each identified native entry-point. If the native-side CFG does not contain any security checkpoint (e.g., permission check, UID check, and PID check), we discard the CFG for computational efficiency. Otherwise, NatiDroid further utilizes a backward analysis to build a Java-side CFG starting from the paired Java-side entry-point to an Android API. If the reached Android API is further invoked by other Android APIs, we extend the CFG until the API is not called by any other APIs. The CFGs generated from both sides are then connected with the communication models identified in Section 3.2. We detail the mechanisms unique to Android that require additional work to handle as follows.

Handling the service identifier. The aforementioned AIDL is of ten used to invoke remote methods in service. Before the invocation, the service is usually pointed by passing a string to the getService or checkService method, for example, the string “media.player” in line 19 of Figure 8. When building the call graph, we need to handle such remote invocation and identify which class is the identifier string actually pointed to. These services are registered to the system through an addService method (either on Java-side or Native-side). Therefore, we can automatically collect the corre spondences between these identifiers and service classes from it. First, we scan the Java and C++ files looking for addService meth ods. Then the program confirms whether the method is ServiceManager.addService or defaultServiceManager->addService, separately. Once confirmed, the program extracts a pair of service class and the corresponding identifier from the parameters; for example, the string identifier “media.player” in Figure 8 will be paired with its service class MediaPlayerService.

Handling Android strong pointer. Although the strong pointer defines the type variable, the type is not necessary to be restricted. Therefore, we need additional efforts to get what type the strong pointer actually points to from the context. According to the strong pointer mechanism, when we find that a member function is called, if the object is declared as a strong pointer, the invocation of the member function will be determined automatically. The NatiDroid will first trace the statement where an object is assigned to the strong pointer. If it is assigned through method getService(), the type of the object will be determined by the passed service identifier; otherwise, NatiDroid will determine the type of strong pointer according to the variable type returned by the function.

Handling member variables. A class member variable is also possibly assigned by a strong pointer. For example, in line 7 of Figure 8, the setAudioSource() method of the mMediaRecorder object is invoked, but from this line of code we cannot determine the type of variable mMediaRecorder so that we cannot determine which setAudioSource() method has been invoked. By looking up in the referenced header file (from line 23), we find that this variable is a member variable of class MediaRecorder. Therefore, to find out the type of this variable, we can search the entire class looking for the assignment or initialization statement. Note that the assignment should be ignored if the assignment releases the pointer, e.g., pointing the variable to a null pointer. In this case, the variable mMediaRecorder is initialized as the return value of createMediaRecorder in the class constructor (line 12). We have explained how to determine the return type of createMediaRecorder in the previous paragraph, handling Android strong pointers. To implement the process automatically, if a variable cannot be tracked in the local scope, NatiDroid will point to the header file to check whether it is a member variable, and thus the tracking scope will be expanded to the entire class.

Handling particular access control. Binder transaction executes with the identity (i.e., PID and UID) of a caller. If a transaction needs to run with callee’s identity, Binder.clearCallingIdentity() can temporarily clear the caller’s identity, so that the callee’s identity will be checked during the transaction. Then the caller’s identity can be restored by Binder.restoreCallingIdentity(). To handle such intervened access control, NatiDroid uses callee’s identity for security control for all methods between the two Binder methods.

3.3.2 Protection Mapping Extraction. We present the workflow of extracting the protection mapping of Android APIs. After obtaining the cross-language CFG, we resort to a Depth-First Search (DFS) strategy to check if there are call paths between Android APIs in the Java API Framework layer and security checks in the Native Library layer. For each node in the CFG, if it is a Java node, we will collect all its native children and obtain the security checks defined in its children nodes. If there are more than one checkpoint on the call trace (e.g., an Android API is protected by multiple permissions),

we concatenate them with AND operation. As inspired by Aafer et al. [9], we also include security checks other than permission enforcement, such as UID and PID checks. If there is more than one Android API along the track, we create a mapping entry for each of them (i.e., all Android APIs along the track have the same security check). Finally, each pair of Java API node and its corresponding security check(s) in Native-side will be added into the protection mapping.

4 EVALUATION

The main contribution of this work, NatiDroid, is to enable the cross-language static analysis of the Android framework. Since there is a number of existing tools [4, 37] and works [9, 13] well handling the static analysis of the Java-side of Android framework, NatiDroid specifically focuses on the analysis of the crosslanguage part of the Android framework. In this section, we run experiments to verify the performance of NatiDroid and to answer

the following research questions:

• RQ1: How effective is NatiDroid in identifying permission enforcement in Native libraries?

• RQ2: How many permission mappings can our tool identify that are not recognized by the state-of-the-practice approaches?

• RQ3: How can our tool contribute to real-world security evaluation?

We use NatiDroid to extract the API protection mapping for 5 AOSP versions – 7.0, 7.1, 8.0, 8.1, and 10.0. We obtained the source code from the official AOSP repository [7]. The experiment ran on a Linux server with Intel (R) Core (TM) i9-9920X CPU @ 3.50GHz and 128 GB RAM.

4.1 RQ1: The Performance of NatiDroid

We present the statics of Permission-API protection mapping extracted from the Native libraries in Table 1 (columns 1 to 6). In each of AOSP 7.0, 7.1, 8.0, and 8.1, our tool identified more than 440 APIs whose permission enforcement take place in the Native code, while in AOSP 10.0, NatiDroid detected 282 APIs containing permission enforcement on the Native side. In Table 2, we list the permissions that have permission checks in native libraries, including 2 dangerous permissions, 8 signature permissions, and 2 normal permissions. Signature permissions are only granted if the requesting app is signed with the same certificate as the app that declared the permission. Normal permissions refer to the permissions with minimal risk to the system and the users’ private data. Dangerous permissions are those higher-risk permissions that would give access to private user data or control over the device that can negatively impact the users.

Due to the lack of ground truth for the permission protection mappings, it is difficult to evaluate the overall accuracy of the extracted mappings. We therefore resort to a manual process to examine the correctness of our mappings. To this end, we manually inspect the involved source code in the AOSP codebase to confirm if the APIs will go through the corresponding security check(s) and if the condition(s) in the security check(s) is(are) consistent with the condition(s) in the mappings. we randomly selected 20% of the total mappings (i.e., 421 mappings) for manual inspection and found no missing or redundant protection conditions in our mappings.

We further looked into the results on the affected API in each AOSP. In AOSP 10.0, the number of affected APIs decreased significantly, compared to other versions (282 in AOSP 10.0 vs. 456*.5 ± 7.*5 in AOSP 7.1 to 8.1). Through the manual code review, we found that two signature permissions, ACCESS_DRM_CERTIFICATES and CONTROL_WIFI_DISPLAY, have been removed in AOSP 10.0. At the same time, the code related to the audio service was refactored and simplified, which greatly reduced the number of exposed methods.

Answer to RQ1. NatiDroid identified more than 440 APIs that are protected by permission enforcement in the Native code in AOSP 7.1-8.1, and 282 APIs in AOSP 10.0, which are verified by manual inspection (20% sampling).

4.2 RQ2: Comparison with other mappings

State-of-the-art works use static and dynamic analysis approaches to extract the API-permission protection mapping from the Android framework. For instance, Axplorer [13] and Arcade [9] use static methods to retrieve the protection mapping. Dynamo [14], on the other hand, extracts the mapping by leveraging dynamic analysis.

This RQ compares our results with the mappings derived from these approaches. Unfortunately, the mapping from Dynamo lacks method signature information (e.g., class name, return type, and parameter list), which is required to identify a method (e.g., there could be methods with identical names in different classes). We cannot accurately compare our mapping with Dynamo’s mapping.

Therefore, in this RQ, we compare our work with Axplorer and Arcade. The mappings of Axplorer and Arcade are derived from their public repositories3 Note that since both works only released their mapping results (for AOSP version 7.1 and under) in lieu of the tools, we are unable to obtain the mappings on AOSP 8.0, 8.1, and 10.0. Table 1 presents the comparison of permission-API protection mappings on the five AOSP versions. Unsurprisingly, sinceNatiDroid only focuses on the permission checks in the Android Native libraries, while Axplorer and Arcade only work on the Android Java framework, there is no overlap between NatiDroid and the two works, respectively. As shown in the first two columns of Table 1, the number of newly identified permissions that are missed in previous works ranges from 9 to 12 in the five AOSP versions. There are 282 to 464 Android APIs associated with these permissions (i.e., invoking these APIs requires the corresponding permissions to be granted), counting up to approximate 30% of the mappings reported in the previous study, which are overlooked in the previous work that only analyzed the Java-side of the Android framework. Table 1 reports the breakdown of the mappings based on the permission protection levels. The mapping results contain the permissions in signature, dangerous, and normal levels. Missing these mappings, especially the ones for the dangerous permissions, will lead to false results in detecting security issues of Android apps, such as permission over-privilege (detailed in Section 4.3). Thus, the main security and privacy threats to the majority of Android apps (i.e., thirdparty apps, which usually have no access to signature permissions) are caused by inaccurate mapping of dangerous permissions. It is therefore worth highlighting that NatiDroid has identified the mappings for two additional dangerous permissions CAMERA and RECORD_AUDIO, which are closely related to user’s privacy.

Answer to RQ2. NatiDroid is able to identify 2 dangerous permission, 8 signature permissions, and 2 normal permissions that previous works have missed. Approximate 30% of the mappings reported in the previous study are related to these permissions.

4.3 RQ3: Applications of Protection Mappings

The protection mappings can be leveraged to detect security issues of Android apps, such as permission over-privilege. In this subsection, we evaluate the effectiveness of our extracted mappings in identifying the security vulnerabilities.

We include two categories of Android apps in our experiments: the custom ROM apps that are pre-installed on the devices, such as Camera and Calendar, as well as the third-party apps that users can download from official or alternative app stores (e.g., social networking apps, banking apps). The experimental dataset contains 1,035 custom ROM apps extracted from five Android custom ROMs of four vendors (i.e., Samsung, LG, Huawei, and Xiaomi) and 10,000 third-party apps randomly downloaded from the Google Play store.

Table 3 shows an overview of the dataset in use. We use the permission protection mappings extracted from AOSP to detect security vulnerabilities in both the custom ROM apps and third-party apps. The AOSP mapping may miss some vendorcustomized permissions (e.g., huawei.permission.SET_SMSC_ADDRESS), which may be used in the custom ROM apps. Nevertheless, we argue that using the AOSP mapping may miss some vulnerabilities caused by the misuse of vendor-customized permissions but will not affect the results corresponding to the official permissions, serving as the main scope of our study.

Permission Over-privilege Detection. Android app developers access Android framework functionalities by invoking Android APIs. Some APIs have access to sensitive information, such as reading the contact list, are protected by permissions. Developers need to request such permissions from the Android system before accessing the sensitive resources. Specifically, a list of required permissions need to be declared in the AndroidManifest.xml file, and the corresponding permissions protected APIs are to be invoked in the app’s implementation.

![ Figure 9: False-positive over-privileged permissions in previous works detected by NatiDroid.

](https://images.mirror-media.xyz/publication-images/vMt5rABrFSO4OWD0EQy_m.png?height=255&width=470)

According to the Android developers’ documentation [5], app developers should request a minimum set of permissions required to complete the app’s functionality, as introducing additional permissions will increase the risk of privacy leak. However, developers usually (either intentionally or unintentionally) request permissions that are not related to the functionalities actually implemented in the app, and hence, not necessary [17]. To detect apps with the permission over-privilege issue, we extract the reachable APIs of the app (with a 30-min timeout), and retrieve its protection conditions (e.g., permission, UID, PID) according to the mappings. For instance, when invoking an Android API, it may check the UID (e.g., uid == AID_SYSTEM checks if the app has system privilege) and the PID (e.g., callingPid == getpid() examines if the method is called by its own process) along with permission enforcement. While the UID can be retrieved statically, the PID has to be determined at run-time, thereby cannot be obtained through static analysis. Nevertheless, the apps included in the experiment are custom ROM apps and third-party apps, which cannot possess PID of Android system services. Therefore, it is safe to assume that callingPid == getpid() will always return false in our tested apps. Finally, if the app declares permission (in the AndroidManifest.xml file) that is not required (i.e., no APIs associated with the permissions found in the app), we flag it as an over-privilege case. Results. Table 4 presents the over-privilege detection results.

To demonstrate the effectiveness of our mappings in pinpointing the permission over-privilege issue, we compare previous works’ results with our results. Recall that NatiDroid only extracts protection mappings from the Native libraries, in our results (i.e., Nati + Ar and Nati + Ax), the Java-side mappings are derived from Arcade and Axplorer. We identify 95.8% and 95.5% apps with a permission over-privilege issue using Arcade’s and Axplorer’s mappings, respectively. Among their results, we identify that 66.6% and 54.2% apps (in Arcade’s and Axplorer’s results, respectively) contain false-positive results caused by missing Native-triggered permission mappings. Specifically, as shown in the last four columns of Table 4, there are 8,063 and 357 permissions that are erroneously identified as over-privilege by Axplorer in 71.5% third-party apps and 22.2% custom ROM apps, respectively; for Axplorer, 7,894 permissions in 71.38% third-party apps and 309 permissions in 21.02% custom ROM apps are found to be false-positive. Interestingly, both Arcade and Axplorer report that a significantly high proportion of apps (approximately 96%) suffer from a permission over-privilege issue. We therefore take an in-depth look into their detection results and observe that the majority of their false positives are caused by missing native triggered INTERNET permission mappings. As illustrated in Figure 9, we further present the breakdown of permissions that cause the false positive results in the comparing methods.

Specifically, missing INTERNET permission mappings leads to 6,661 and 6,660 false positives in Arcade’s and Axplorer’s results. Other missing permission mappings that contribute to the false positives include RECORD_AUDIO (623 false positives in both Arcade and Axplorer), MODIFY_AUDIO_- SETTINGS (424 and 256 false positives in Arcade and Axplorer, respectively), and CAMERA (355 false positives in both Arcade and Axplorer).

Manual inspection. Due to the lack of ground truth, we manually inspect if the over-privileged permissions detected are indeed unneeded by the containing apps. The first two authors of this paper and three security researchers are involved in the manual inspection. The result is determined via majority voting. For each app, we decompile the APK file and locate the relevant APIs. Then, we manually check the app’s context and determine whether the invocation of the APIs meet the conditions in the protection mappings. As this process involves immense manual efforts, it cannot scale to cover a large number of apps. Hence, we manually verified 100 randomly selected apps. Our manual analysis indicates that most of the cases are true positives (95%). The remaining five apps contain implicit parameters passed to the APIs to be examined,which cannot be precisely inferred via static analysis. Nevertheless, we resort to a dynamic approach to verify the remaining five apps. Specifically, we remove the permissions in question from the AndroidManifest.xml file and repackage the app. Then, we manually test the app on an emulator to confirm if the app crashes or the corresponding functions are disabled. As a result, the removal of the permissions in question has no impact on the apps, suggesting that these permissions are indeed unneeded.

Answer to RQ3. Our tool identified that more than half of the over-privilege results from state-of-the-practice tools are false-positives, which is caused by missing Native-triggered permission mappings.

5 THREATS TO VALIDITY Android versions. In this paper, we propose a solution to facilitate cross-language static analysis of the Android framework andbuild a prototype system, NatiDroid, to extract API-permission mapping from the Android operating system. To compare with the state-of-the-practice works Axplorer and Arcade, which are close sourced and only generated the mappings up to Android 7.1, in our experiment, we extract the mappings from the latest versions they have (i.e., 7.0 and 7.1) and three newer versions (i.e., 8.0, 8.1 and 10.0). Nevertheless, the proposed solution can apply to any Android version, with further engineering works to be done in the pre-processing module. Custom ROMs. Android smartphones such as Samsung, Huawei and Xiaomi, are shipped with vendor-customized Android systems (i.e., custom ROMs) rather than the AOSP. Unfortunately, these custom ROMs are not open-source. The proposed solution takes the source code as input; therefore, it cannot extract permission mappings from these close sourced custom ROMs. However, smartphone vendors can use our solution to analyze their customized Android versions based on their source code. Nevertheless, to maintain the compatibility of running third-party apps, such custom ROMs are not likely to modify the normal and dangerous level permission specifications of AOSP that third-party apps can access, but rather add a few signature level permissions for their own system apps. Therefore, the results derived from AOSP will not affect the security analysis of third-party apps, which are the majority in the Android ecosystem. On the other hand, with NatiDroid, thirdparty vendors can perform an inner security analysis on custom ROM source code, determine whether there are errors in the implementation of permission mappings, and further detect permission over-privilege before releasing an app. Static analysis. When detecting over-privilege issues in Android apps, we may suffer from the intrinsic vulnerability of static code analysis when encountering code obfuscation and reflection, which may lead to the unsoundness of our results. When building the apps’ call graph, our method may yield unsound results because it may miss the context and the parameters that can only be obtained at run-time. For example, the API android.media.MediaPlayer: void setDataSource requires the Internet permission when the data source is online media. The source is not always a static string so that it may be assigned at run-time. Nevertheless, these challenges are regarded as well known and non-trivial issues to overcome in the research community [34].

6 RELATED WORK

Chaoran citation, surname not first name Android API protection mapping. Stowaway [17] initially explored and analyzed the Android permission specification.

They extracted API mappings based on the feedback directed fuzz testing, and dynamically recorded the permission checks of APIs. The mappings they extracted are accurate, but the code coverage is limited. To address the limitations of low code coverage, PScout [12] uses static analysis to extract the API protection mapping. However, they did not consider the context of the API invocation, and thus may produce false positive mappings. Axplorer [13] leverages more accurate static analysis on the SDK and Android framework, and generates more precise permission specifications. Arcade [9] conducted a similar static analysis, with additional attention paid to extract other security mechanisms, such as UID and PID checks. While these works only analyzed the Java-side of the Android, none of the works has looked into the native libraries within the Android framework. In order to overcome the limitations of static analysis, Dynamo [14] uses dynamic analysis, aiming to obtain more accurate mapping. PSGen [39] conducts an analysis permission specification for Android NDK (i.e., both protected APIs and permission checks are in the Native code), which has a different scope to our approach (i.e., protected APIs are in the Java framework, while the permission checks are in the Native code).

Our work fills the research gap by analyzing the native libraries and their communications with the Java framework to produce more comprehensive permission protection mappings. Cross-language analysis on Android. Chaoran NCScope: hardwareassisted analyzer for native code in Android apps A plethora of works have proposed to solve the analysis of cross-language code. George et al. [20] scanned the binary libraries and cross-referenced the information to search the call-backs from Native code to Java. Their work focuses on the JNI mechanism alone. However, the Android framework provides other IPC mechanisms, such as AIDL, which are not considered. Fengguo et al. [38] proposed a static analysis framework that focuses on performing cross-language modeling and generating call graphs for the analyzed apps. Jucify [36] conducts a static analysis approach which extends the app analysis to native codes. Nevertheless, these works are only applicable to Android apps, which are far less complicated than the Android framework we analyzed. In addition, our cross-language analysis handles various Android IPC mechanisms such as JNI and AIDL.

7 CONCLUSION

We proposed a novel approach, NatiDroid, to facilitate the crosslanguage analysis of the Android framework. NatiDroid identifies the entry-point pairs of both Java- and Native-sides of the Android framework, where both sides are communicated through JNI and AIDL based mechanisms, so NatiDroid builds the cross-language CFG on the overall Android framework (Java + Native code). Based on the cross-language CFG, we extracted Native-triggered permission specifications and created the protection mappings in the native code to complement existing Java-based mappings. We further applied our new mappings to detect permission over-privilege vulnerabilities in a large dataset consisting of more than 11,000 Android apps. We finally show that using the mapping derived by NatiDroid can identify a significant number of false results existing in the state of the art, such as Axplorer and Arcade.

REFERENCES

[1] 1999. Soot - Java Analysis Framework. http://sable.github.io/soot/.

[2] 2000. Clang: A C language family frontend for LLVM. https://clang.llvm.org.

[3] 2000. Introduction to the Clang AST. https://clang.llvm.org/docs/

IntroductionToTheClangAST.html.

[4] 2006. WALA: T.J. Watson Libraries for Analysis. https://github.com/wala/WALA.

[5] 2008. Developer Guides | Android Developers. https://developer.android.com/

guide.

[6] 2012. Soong Build System. https://source.android.com/setup/build.

[7] 2021. Android Open Source Project. https://source.android.com.

[8] 2021. Google Play - Camera360 Photo Editor + Camera & Beauty Selfies. https:

//play.google.com/store/apps/details?id=vStudio.Android.Camera360.

[9] Yousra Aafer, Guanhong Tao, Jianjun Huang, Xiangyu Zhang, and Ninghui Li.

2018. Precise Android API protection mapping derivation and reasoning. In

Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications

Security. 1151–1164.

[10] Yousra Aafer, Nan Zhang, Zhongwen Zhang, Xiao Zhang, Kai Chen, XiaoFeng

Wang, Xiaoyong Zhou, Wenliang Du, and Michael Grace. 2015. Hare hunting in

the wild Android: A study on the threat of hanging attribute references. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications

Security. 1248–1259.

[11] Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. 2014.

Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint

analysis for Android apps. ACM SIGPLAN Notices 49, 6 (2014), 259–269.

[12] Kathy Wain Yee Au, Yi Fan Zhou, Zhen Huang, and David Lie. 2012. Pscout:

analyzing the Android permission specification. In Proceedings of the 2012 ACM

conference on Computer and Communications Security. 217–228.

[13] Michael Backes, Sven Bugiel, Erik Derr, Patrick McDaniel, Damien Octeau, and

Sebastian Weisgerber. 2016. On Demystifying the Android Application Framework: Re-Visiting Android Permission Specification Analysis. In 25th USENIX

Security Symposium (USENIX Security 16). USENIX Association, Austin, TX,

1101–1118. https://www.usenix.org/conference/usenixsecurity16/technicalsessions/presentation/backes_android

[14] Abdallah Dawoud and Sven Bugiel. 2021. Bringing balance to the force: Dynamic

analysis of the android application framework. In NDSS.

[15] Manuel Egele, David Brumley, Yanick Fratantonio, and Christopher Kruegel.

2013. An empirical study of cryptographic misuse in android applications. In

Proceedings of the 2013 ACM SIGSAC conference on Computer & communications

security. 73–84.

[16] Sascha Fahl, Marian Harbach, Marten Oltrogge, Thomas Muders, and Matthew

Smith. 2013. Hey, you, get off of my clipboard. In International Conference on

Financial Cryptography and Data Security. Springer, 144–161.

[17] Adrienne Porter Felt, Erika Chin, Steve Hanna, Dawn Song, and David Wagner.

2011. Android permissions demystified. In Proceedings of the 18th ACM conference

on Computer and Communications Security. 627–638.

[18] Yu Feng, Saswat Anand, Isil Dillig, and Alex Aiken. 2014. Apposcopy: Semanticsbased detection of Android malware through static analysis. In Proceedings of

the 22nd ACM SIGSOFT International Symposium on Foundations of Software

Engineering. 576–587.

[19] Earlence Fernandes, Jaeyeon Jung, and Atul Prakash. 2016. Security analysis

of emerging smart home applications. In 2016 IEEE symposium on security and

privacy (SP). IEEE, 636–654.

[20] George Fourtounis, Leonidas Triantafyllou, and Yannis Smaragdakis. 2020. Identifying Java calls in native code via binary scanning. In Proceedings of the 29th ACM

SIGSOFT International Symposium on Software Testing and Analysis. 388–400.

[21] Clint Gibler, Jonathan Crussell, Jeremy Erickson, and Hao Chen. 2012. AndroidLeaks: automatically detecting potential privacy leaks in Android applications

on a large scale. In International Conference on Trust and Trustworthy Computing.

Springer, 291–307.

[22] Michael I Gordon, Deokhwan Kim, Jeff H Perkins, Limei Gilham, Nguyen Nguyen,

and Martin C Rinard. 2015. Information flow analysis of Android applications in

DroidSafe. In NDSS, Vol. 15. 110.

[23] Michael C Grace, Yajin Zhou, Zhi Wang, and Xuxian Jiang. 2012. Systematic

detection of capability leaks in stock Android smartphones. In NDSS, Vol. 14. 19.

[24] Jianjun Huang, Xiangyu Zhang, and Lin Tan. 2016. Detecting sensitive data disclosure via bi-directional text correlation analysis. In Proceedings of the 2016 24th

ACM SIGSOFT International Symposium on Foundations of Software Engineering.

169–180.

[25] Soo Hyeon Kim, Daewan Han, and Dong Hoon Lee. 2013. Predictability of

android OpenSSL’s pseudo random number generator. In Proceedings of the 2013

ACM SIGSAC conference on Computer & Communications Security. 659–668.

[26] William Klieber, Lori Flynn, Amar Bhosale, Limin Jia, and Lujo Bauer. 2014.

Android taint flow analysis for app sets. In Proceedings of the 3rd ACM SIGPLAN

International Workshop on the State of the Art in Java Program Analysis. 1–6.

[27] Li Li, Alexandre Bartel, Jacques Klein, Yves Le Traon, Steven Arzt, Siegfried

Rasthofer, Eric Bodden, Damien Octeau, and Patrick Mcdaniel. 2014. I know

what leaked in your pocket: Uncovering privacy leaks on Android Apps with

Static Taint Analysis. arXiv preprint arXiv:1404.7431 (2014).

[28] Tongxin Li, Xiaoyong Zhou, Luyi Xing, Yeonjoon Lee, Muhammad Naveed, XiaoFeng Wang, and Xinhui Han. 2014. Mayhem in the push clouds: Understanding

and mitigating security hazards in mobile push-messaging services. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications

Security. 978–989.

[29] Baozheng Liu, Chao Zhang, Guang Gong, Yishun Zeng, Haifeng Ruan, and

Jianwei Zhuge. 2020. FANS: Fuzzing Android Native System Services via Automated Interface Analysis. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 307–323. https://www.usenix.org/conference/

usenixsecurity20/presentation/liu

[30] Kangjie Lu, Zhichun Li, Vasileios P Kemerlis, Zhenyu Wu, Long Lu, Cong Zheng,

Zhiyun Qian, Wenke Lee, and Guofei Jiang. 2015. Checking more and alerting

less: Detecting privacy leakages via enhanced data-flow analysis and peer voting.

In NDSS.

[31] Mohammad Nauman, Sohail Khan, and Xinwen Zhang. 2010. Apex: Extending

Android permission model and enforcement with user-defined runtime constraints. In Proceedings of the 5th ACM symposium on Information, Computer and

Communications Security. 328–332.

[32] Damien Octeau, Patrick McDaniel, Somesh Jha, Alexandre Bartel, Eric Bodden, Jacques Klein, and Yves Le Traon. 2013. Effective Inter-Component Communication Mapping in Android: An Essential Step Towards Holistic Security

Analysis. In 22nd USENIX Security Symposium (USENIX Security 13). USENIX

Association, Washington, D.C., 543–558. https://www.usenix.org/conference/

usenixsecurity13/technical-sessions/presentation/octeau

[33] Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, and Tao Xie. 2013. WHYPER: Towards Automating Risk Assessment of Mobile Applications. In 22nd

USENIX Security Symposium (USENIX Security 13). USENIX Association, Washington, D.C., 527–542. https://www.usenix.org/conference/usenixsecurity13/

technical-sessions/presentation/pandita

[34] Felix Pauck, Eric Bodden, and Heike Wehrheim. 2018. Do android taint analysis

tools keep their promises?. In Proceedings of the 2018 26th ACM Joint Meeting on

European Software Engineering Conference and Symposium on the Foundations of

Software Engineering. 331–341.

[35] Zhengyang Qu, Vaibhav Rastogi, Xinyi Zhang, Yan Chen, Tiantian Zhu, and

Zhong Chen. 2014. Autocog: Measuring the description-to-permission fidelity

in Android applications. In Proceedings of the 2014 ACM SIGSAC Conference on

Computer and Communications Security. 1354–1365.

[36] Jordan Samhi, Jun Gao, Nadia Daoudi, Pierre Graux, Henri Hoyez, Xiaoyu Sun,

Kevin Allix, and Bissyandé. 2022. JuCify: A Step Towards Android Code Unification for Enhanced Static Analysis. In 2022 IEEE/ACM 44th International Conference

on Software Engineering (ICSE).

[37] Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and

Vijay Sundaresan. 2010. Soot: A Java bytecode optimization framework. In

CASCON First Decade High Impact Papers. 214–224.

[38] Fengguo Wei, Sankardas Roy, and Xinming Ou. 2018. Amandroid: A precise and

general inter-component data flow analysis framework for security vetting of

Android apps. ACM Transactions on Privacy and Security (TOPS) 21, 3 (2018),

1–32.

[39] Hao Zhou, Haoyu Wang, Shuohan Wu, Xiapu Luo, Yajin Zhou, Ting Chen, and

Ting Wang. 2021. Finding the Missing Piece: Permission Specification Analysis

for Android NDK. In 2021 36th IEEE/ACM International Conference on Automated

Software Engineering (ASE). IEEE, 505–516.

Fuzzing: A Survey for Roadmap

hyperlab@newsletter.paragraph.com (HyperLab) — Fri, 22 Jul 2022 20:52:23 GMT

XIAOGANG ZHU, Swinburne University of Technology, Australia

SHENG WEN∗, Swinburne University of Technology, Australia

SEYIT CAMTEPE, CSIRO Data61, Australia

YANG XIANG, Swinburne University of Technology, Australia

Fuzz testing (fuzzing) has witnessed its prosperity in detecting security flaws recently. It generates a large number of test cases and monitors the executions for defects. Fuzzing has detected thousands of bugs and vulnerabilities in various applications. Although effective, there lacks systematic analysis of gaps faced by fuzzing. As a technique of defect detection, fuzzing is required to narrow down the gaps between the entire input space and the defect space. Without limitation on the generated inputs, the input space is infinite. However, defects are sparse in an application, which indicates that the defect space is much smaller than the entire input space. Besides, because fuzzing generates numerous test cases to repeatedly examine targets, it requires fuzzing to perform in an automatic manner. Due to the complexity of applications and defects, it is challenging to automatize the execution of diverse applications. In this paper, we systematically review and analyze the gaps as well as their solutions, considering both breadth and depth. This survey can be a roadmap for both beginners and advanced developers to better understand fuzzing.

CCS Concepts: • Security and privacy → Software security engineering; • Fuzz Testing;

Additional Key Words and Phrases: Fuzz Testing, Security, Fuzzing Theory, Input Space, Automation

ACM Reference Format:

Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xiang. 2022. Fuzzing: A Survey for Roadmap. ACM Comput. Surv. 1, 1, Article 1 (January 2022), 35 pages. https://doi.org/10.1145/3512345

1 INTRODUCTION

Software security is a severe problem of computer systems, which affects people’s daily life and even causes severe financial problems [109, 169]. Fuzz testing has become one of the most successful techniques to detect security flaws in programs. Fuzzing generates numerous test cases to test target programs repeatedly and monitors the exceptions of the program. The exceptions are the indicators of potential security flaws. Generally, fuzzing has a queue of seeds, which are the interesting inputs, and new inputs are generated via mutating seeds in an infinite loop. By steering computing resources for fuzzing in different ways [22, 23, 37, 40, 65, 188, 202], researchers can discover vulnerabilities more effectively and eficiently. Until now, fuzzing has discovered thousands of bugs in real-world programs, such as bugs discovered in general applications [30], IoT (Internet of Things) devices [36], firmware [211], kernels [194], and database systems [212].

∗Corresponding author.

————————————————————————————————————————

Authors’ addresses: Xiaogang Zhu, xiaogangzhu@swin.edu.au, Swinburne University of Technology, Melbourne, VIC, Australia; Sheng Wen, swen@swin.edu.au, Swinburne University of Technology, Melbourne, VIC, Australia; Seyit Camtepe, Seyit.Camtepe@data61.csiro.au, CSIRO Data61, Sydney, NSW, Australia; Yang Xiang, yxiang@swin.edu.au, Swinburne University of Technology, Melbourne, VIC, Australia.

————————————————————————————————————————

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

—————————————————————————————————————————

Fig. 1. Illustration of knowledge gaps in the domain of fuzzing. Fuzzing theories are utilized to improve the efficiency of defect detection (§3). The reduction of input space restricts the generated inputs to valid input space (§4). The automatic execution is the basis to apply fuzzing on various applications (§5).

Although fuzzing has achieved great success in detecting security flaws, it still has knowledge gaps for developing eficient defect detection solutions. As depicted in Fig.1, the three main gaps are the sparse defect space of inputs, the strict valid input space, and the automatic execution of various targets. The following paragraphs explain the details of the gaps.

Gap 1: sparse defect space of inputs. Defects are sparse in applications, and only some specific inputs can trigger the defects. Because the essential purpose of fuzzing is to detect defects in target applications, fuzzing theories are required to generate inputs that belong to the defect space. Some security flaws are shallow so that they can be discovered in a short time of fuzzing campaigns. However, many security flaws require fuzzing to examine complex execution paths and solve tight path constraints. Therefore, a high eficient fuzzing algorithm requires a sophisticated understanding of both programs under test (PUTs) and security flaws. Because defects are usually unknown before fuzzing campaigns, fuzzing theories based on the understanding of PUTs and/or security flaws steer computing resources towards code regions that are more likely to have defects.

Gap 2: strict valid input space. Fuzzing has been utilized in various applications, and each application requires its own specific inputs. Modern applications are increasingly larger, resulting in more complex specifications of inputs. Therefore, it is challenging to generate valid inputs that target applications accept. Moreover, to improve the eficiency of fuzzing, it is better that the generated inputs exercise different execution states (e.g., code coverage). This requires fuzzing to develop more advanced schemes for the generation of valid inputs. Without a systematic analysis of PUTs, it is almost impossible to restrict the input space precisely. For instance, a random mutation of PDF files may breach the specifications of PDF. Fuzzing needs to mutate PDF files carefully so that the generated inputs belong to the valid input space.

Gap 3: various targets. Because fuzzing repeatedly tests PUTs for numerous trials, fuzzing processes are required to perform automatically for high eficiency. Due to the diversity of PUTs and defects, the execution environments vary. It is straightforward for some applications to be fuzzed in an automatic manner, such as command-line programs. However, many other applications require extra efforts to be tested automatically, such as hardware. Moreover, the security flaws also require automatic indicators to record potential real flaws. Program execution crash is a widely-used indicator as it automatically catches exceptions from operating systems. However, many other security flaws may not manifest themselves as crashes, such as data races. These flaws require careful designs so that they are automatically recorded during fuzzing campaigns.

The research community has put in many efforts to narrow down these gaps. Many theories have been proposed to formulate fuzzing processes partially. Besides, various approaches have been designed to reduce the input space. Moreover, different execution scenarios have been successfully automatized. A few surveys have shed light on fuzzing, but none of them systematically review and analyze the gaps of fuzzing as well as their solutions [102, 107, 119]. Therefore, it is still unclear about questions such as what the gaps are, what the potential solutions are, and how to bridge the gaps. While other surveys focus on what fuzzing is, this paper also explains how and why the existing solutions solve problems. With such information, this paper builds a roadmap to bridge the gaps and pave roads for future research. For beginners who have limited knowledge of fuzzing, this paper provides them with the conception of fuzzing and the existing solutions to improve fuzzing. For the advanced developers, this paper also provides them with three main roads (i.e., three gaps) so that they can make a breakthrough by following one or more roads.

In this paper, we systematically review and analyze the gaps and solutions of fuzzing, considering both breadth and depth. Since fuzzing is primarily a technique for security issues, we mainly collect papers from security and software conferences, including but not limited to four cyber security conferences and three software engineering conferences, from January 2008 to May 2021. The four cyber security conferences are the ACM Conference on Computer and Communications Security (CCS), the Network and Distributed System Security Symposium (NDSS), the IEEE Symposium on Security and Privacy (S&P), and Usenix Security Symposium (USENIX). The three software engineering conferences are International Conference on Automated Software Engineering (ASE), International Conference on Software Engineering (ICSE), and ACM SIGSOFT Symposium on the Foundation of Software Engineering/European Software Engineering Conference (FSE/ESEC).

The paper is organized as follows. §2 introduces the overview of fuzzing. §3 depicts fuzzing processes and various fuzzing theories to formulate the processes. §4 analyzes diverse solutions to reduce the search space of inputs. §5 analyzes how to automatize the execution of various PUTs and the detection of different bugs. §6 offers some directions for future research.

2 OVERVIEW OF FUZZING

Terminologies. We first introduce some terminologies for the convenience of reading. As depicted in Fig.2, an input is retained as a seed when the input achieves better fitness (e.g., new coverage). The fitness measures the quality of a seed or input. Later fuzzing campaigns will select seeds in the seed set to perform mutations. The mutation operators are called mutators. When a seed is selected from the seed set, a power schedule determines the energy assigned to the seed. The energy is the number of mutations assigned for the current fuzzing round. The implementation of a fuzzing algorithm is called a fuzzer.

In 1990, B.P. Miller et al. tested 90 programs by running them on random input strings and found that more than 24% of them crashed [129]. The program that generated random input strings was named fuzz by Miller. Since then, fuzz testing (or fuzzing) has become the name of the technique that identifies bugs via numerous test cases. Nowadays, because programs are increasingly more complex and fuzzing is an inexpensive approach, fuzzing has been one of the major tools for bug discovery. As shown in Fig.2, fuzzing consists of three basic components, namely input generator, executor, and defect monitor. The input generator provides the executor with numerous inputs, and the executor runs target programs on the inputs. Then, fuzzing monitors the execution to check if it discovers new execution states or defects (e.g., crashes).

Fuzzing can be divided into generation-based and mutation-based fuzzing from the perspective of input generation. The generation-based fuzzing generates inputs from scratch based on grammars [3, 56, 57] or valid corpus [74, 79, 111, 112, 185]. As shown in Fig.2, generation-based fuzzing gets inputs directly from a seed set. On the other hand, the mutation-based fuzzing mutates existing inputs, which are called seeds, to get new inputs [22, 23, 66, 206, 216]. Given a seed set, mutation-based fuzzing performs seed schedule, byte schedule, and mutation schedule to obtain inputs. Note that, it is not necessary for fuzzing to go through all steps in Fig.2. For example, generation-based fuzzing does not conduct byte schedule or mutation schedule but focuses on selecting the optimal seed set from initial input files.

Based on the amount of information observed during execution, fuzzing can be classified into blackbox, greybox, and whitebox fuzzing. Blackbox fuzzing does not have any knowledge about the internal states of each execution [3, 29, 36, 78, 99]. These fuzzers usually optimize fuzzing processes via utilizing input formats [59, 78, 99] or different output states [55, 61]. Whitebox fuzzing obtains all the internal knowledge of each execution, enabling it to systematically explore the state space of target programs. Whitebox fuzzing usually utilizes concolic execution (i.e., dynamic symbolic execution) to analyze target programs [71, 81, 150, 177, 204]. Greybox fuzzing obtains the knowledge of execution states between blackbox and whitebox fuzzing, e.g., many fuzzers use edge coverage as the internal execution states [7, 23, 66, 133, 216].

The most common execution states are the information of code coverage (e.g., basic blocks [133] or edges [206] in control flow graphs (CFGs)). The basic assumption for the usage of coverage is that the discovery of more execution states (e.g., new coverage) increases the possibility of finding bugs. For example, Miller [130] reports that the increase of 1% in code coverage results in the increase of 0.92% in bug discovery. Therefore, coverage-guided fuzzing aims to explore more code coverage [23, 66, 101, 203, 206]. Yet, due to the diverse usage of fuzzing, the execution states are not limited to code coverage. The states can be legality of executions for object-oriented programs [141], state machine for protocol implementations [3, 11, 55, 61, 64, 69, 154], alias coverage for concurrency implementations [194], neuron coverage for deep learning models [148], or execution logs for Android SmartTVs [1].

Fuzzers often use crashes as the indicator of security bugs because crashes offer straightforward automatic records [206]. The operating systems will automatically incur signals to inform program crashes. However, some defects do not manifest themselves as crashes; thus, fuzzers utilize other defect indicators, such as physical safety violations [41, 82]. Note that, the indicators only show potential security issues, which are further verified by security tools or manual efforts to confirm a vulnerability [85, 168, 175].

3 FUZZING THEORY

The essential objective of fuzzing is to search for the test cases that can trigger defects (e.g., bugs). The challenge is that the search space is infinite due to the complexity of PUTs. Therefore, searching defect-triggering inputs with blind mutation is as hard as finding a needle in a haystack.

Table 1. Fuzzers and their optimization solutions. Based on different understanding of fuzzing processes,optimization solutions are utilized to formulate different processes. Different types of fitness are designed based on features of different applications or bugs.

MC: Markov Chain; MSCP: Minimal Set Cover Problem; ILP: Integer Linear Programming Problem; WCCP: Weighted Coupon Collector’s Problem; VAMAB: Variant of Adversarial Multi-Armed Bandit; UCB1: Upper Confidence Bound, version one; MH: Metropolis-Hastings; PSO: Particle Swarm Optimization; Shannon: Shannon’s entropy; Species∗:Models of Species Discovery; ACO: Ant Colony Optimization; SA: Simulated Annealing; NN: Neural Network; MTNN:Multi-task Neural Networks; GA: Genetic Algorithm; GD: Gradient Descent; MOO: Multi-objective Optimization; R:Random. set.: Seed Set Selection; seed.: Seed Schedule; byte.: Byte Schedule; mutation.: Mutation Schedule; rete.: Seed Retention; : Whitebox Fuzzing; : Greybox Fuzzing; : Blackbox Fuzzing.△: More sensitive code coverage.∗: The paper does not assign energy based on the model but in fact, the model can achieve energy assignment.

To improve the possibility of exposing defects, fuzzers use feedback from executions, such as execution states or results, as the fitness. A typical fitness is based on code coverage (e.g., basic blocks or edges) [23, 39, 116, 157, 203, 210], which is used to determine the reward for the generated inputs. However, code coverage is not always available, and even when it is available, it maynot be sensitive enough to expose defects (§3.5). Moreover, with only code coverage as feedback, generating exponentially more inputs per minute (i.e., using exponentially more machines) exposes linearly more vulnerabilities only [19]. Therefore, a common improvement is to optimize fuzzing processes or enrich the information for fitness.

Fuzzing theories intend to optimize fuzzing processes so that fuzzing can expose defects more effectively and eficiently, i.e., fuzzing can discover defects in a shorter time budget or with fewer test cases. The existing fuzzers use various solutions of optimization problems for fuzzing processes. As shown in Fig.2, fuzzing can optimize the seed set, seed schedule, byte schedule, and mutation schedule with the fitness based on execution states or defect discovery. Table 1 shows the optimization solutions for different fuzzing processes. The column Fitness By indicates how the corresponding solutions formulate fuzzing processes. The columns Gene.- and Muta.-based are generation-based and mutation-based fuzzing, respectively. The fuzzers in Table 1 are selected because their main contributions are the development of fuzzing theories.

3.1 Seed Set Selection

The optimization of seed sets focuses on minimizing the size of a set, i.e., it selects the fewest number of seeds that can cover all the discovered code coverage [2, 147, 206]. The reason for the minimization is that the redundancy of seeds wastes computing resources on examining the well-explored code regions. COVERSET [158] formulates the problem of minimizing a seed set as a minimal set cover problem (MSCP), which minimizes the subsets that contain all elements. Since MSCP is an NP-hard problem, COVERSET uses a greedy polynomial approximation algorithm to obtain the minimal set.

3.2 Seed Schedule

With the seed set, the seed schedule aims to solve the problems of 1) which seed to be selected for the next round, and 2) the time budget for the selected seed. In practice, instead of time budget, most fuzzers optimize the number of times of mutating the selected seeds [23, 203, 206]. Although these two problems are pretty clear, the existing solutions for optimizing them vary due to the complexity of PUTs or bugs. The most critical challenge is the agnostic of undiscovered code coverage or bugs. Before verifying bugs, one cannot know whether an input can trigger a bug; thus, the optimization problem does not have an effective fitness. Similarly, before examining code lines, one cannot obtain the probability distribution of program behaviors (e.g., branch behaviors); thus, it is (almost) impossible to find the global optimal solution mathematically. As a result, researchers approximately or partially formulate fuzzing processes based on diverse optimization problems. As shown in Fig.2, fuzzing theories can be designed based on the properties of defects (e.g., bug arrival) or execution states (e.g., edge transition).

3.2.1 Fitness by #Bugs. Generally, fuzzing utilizes two types of fitness for optimization problems, namely fitness based on bugs or execution states (e.g., code coverage). The fitness is a scalar that measures the quality of a seed or input. Since fuzzing is a technique for bug discovery, an intuitive fitness is the number of bugs. In order to maximize the number of bugs, one approach is to schedule the time budget of each seed while randomly or sequentially selecting seeds. Without considering the execution states, the maximization problem can be simplified as an Integer Linear Programming (ILP) problem [158]. That is, ILP intends to maximize the number of bugs with the linear constraints, such as the upper bound of the time budget for each seed. By solving the ILP problem, one can automatically compute the time budget for each seed. Another insight is to regard the bug arrival process as the weighted Coupon Collector’s Problem (WCCP) [13]. The WCCP estimates how much time is required before a new bug arrives in the process, which is similar to the problem aboutthe expected purchases required before a consumer obtains a new coupon [192]. Each unique bug obtained by fuzzing is a type of coupon, and WCCP aims to predict the number of purchases (time budget) required for the arrival of new coupons (unique bugs). In order to predict the bug arrival time, WCCP requires the distribution of all bugs, which is estimated by observing the discovered bugs. Based on the distribution, fuzzing will gradually assign minimum time, which is predicted by WCCP, for seeds to expose new bugs. Both ILP and WCCP intend to assign more time budget to seeds that have more potential to expose bugs.

3.2.2 Fitness by State Transition (Markov Chain). Because bugs are sparse in PUTs, the optimization process will quickly converge to local optima when using the number of bugs as fitness. Thus, fuzzing will focus on the code regions that are related to discovered bugs. As a result, it may miss the opportunities to explore more code coverage. In this case, deep bugs, which are guarded by complex conditions, can evade fuzzing. To mitigate this problem, fuzzers calculate fitness based on execution states (e.g., code coverage) because execution states can provide fuzzing with more information. Most existing fuzzers compute fitness based on code coverage, i.e., they intend to discover more code coverage [21, 23, 62, 88, 157, 203]. Another reason for using code coverage is that larger code coverage indicates a higher possibility of bug discovery [130].

As depicted in Fig.2, if fuzzing succeeds in formulating state transitions, it can efficiently guide fuzzing to explore undiscovered states. A popular theory for modeling state transition is the Markov chain [93], which is a stochastic process that transitions from one state to another. In a nutshell, the Markov chain maintains a probability table, in which an element 𝑝𝑖𝑗 is the transition probability that the chain transitions from a state 𝑖 to another state 𝑗. In order to formulate fuzzing, one solution is to define a basic block in CFGs as a state, and the state transition is the jump from one basic block to another. Because the transition probabilities of blocks cannot be obtained without running the program, fuzzers calculate the probabilities by recording the frequencies of block transitions during fuzzing [62, 157, 210]. For instance, if an edge 𝐴𝐵 has been examined by 30 times while its neighbor edge 𝐴𝐶 has been examined by 70 times, the transition probabilities of edges 𝐴𝐵 and 𝐴𝐶 are 0*.3 and 0.7, respectively (Fig.3a). The transition probabilities in a path 𝐴𝐵𝐷𝐸𝐺 can be obtained in the same way. Suppose that the probabilities for edges 𝐴𝐵, 𝐵𝐷, 𝐷𝐸 and 𝐸𝐺 are 0.3, 0.25, 0.6 and 0.8, respectively. Then, the fitness of a seed exercising the path 𝐴𝐵𝐷𝐸𝐺 can be calculated as 0.3 × 0.25 × 0.6 × 0.8 = 0.*036. At the beginning of fuzzing, the transition probabilities can be obtained by the Monte Carlo method [125], which examines basic blocks randomly. With the fitness calculated based on the Markov chain, fuzzers can guide the calculation of energy assignment [62, 157] or select the hardest path (lowest transition probability) to be solved by concolic execution [210]. Basically, the lower the transition probability, the better the fitness. The motivation is that the hard-to-reach code regions (low transition probabilities) require more energy to fuzz because the easy-to-reach regions can be easily well explored.The mutation-based fuzzing generates new inputs by mutating seeds, and each input exercises an execution path. This provides another perspective to model a fuzzing process based on the Markov chain. Specifically, the state is defined as an execution path exercised by an input; meanwhile, the state transition is the mutation of an input 𝑡𝑖 , which generates a new input 𝑡𝑗 . Correspondingly, the state transition is also the transition from a path 𝑖, which is exercised by the input 𝑡𝑖 , to another path 𝑗, which is exercised by the input 𝑡𝑗 . Similar to block transitions, the probabilities of path transitions are calculated based on previous executions during fuzzing. As deduced by AFLFast [23], the minimum energy required for a state 𝑖 to discover a new state 𝑗 is 1/𝑝𝑖𝑗 , where 𝑝𝑖𝑗 is the transition probability. Therefore, AFLFast assigns more energy to less frequent paths, i.e., the paths with low path transition probabilities. Variants of this model are investigated in directed greybox fuzzing (DGF) [22, 217] and regression greybox fuzzing (RGF) [214]3.2.3 Fitness by State Transition (Multi-Armed Bandit). Although the Markov chain has achieved great success in modeling state transitions, it is not profound enough to model the seed schedule of fuzzing. The Markov chain requires transition probabilities among all states so that fuzzing makes a proper decision. However, it is common that many states have not been examined during fuzzing, which indicates that the decisions based on the Markov chain are not optimal. For block transitions, one solution is to use the rule of three in statistics for the transitions from a discovered block to an undiscovered block [210]. For path transitions, a naive solution to obtain all transition probabilities is the Round-Robin schedule [156], which splits the time budget between seeds equally [158]. However, this solution cannot determine when to switch from Round-Robin to Markov chain.

The balance between traversing all seeds and focusing on a specific seed is a classic "exploration vs. exploitation" problem. A better solution for solving the "exploration vs. exploitation" problem is to formulate the path transitions as a Multi-Armed Bandit (MAB) problem [14]. For a Multi-Armed Bandit problem, a player intends to maximize the total rewards by observing the rewards of playing some trials on the arms of a slot machine. The exploration is the process when the player plays all arms to obtain their reward expectations. When reward expectations of all arms are known, the exploitation is the process when the player only selects arms with the highest reward expectations. In order to formulate the path transitions, a seed 𝑡𝑖 is defined as an arm. Meanwhile, the reward is the discovery of a new path exercised by an input, which is

generated from the seed 𝑡𝑖. Considering that the total number of seeds is increasing and the reward expectation of a seed is decreasing during fuzzing, EcoFuzz [203] proposes the Variant of the Adversarial Multi-Armed Bandit Model (VAMAB) to solve the problem. Specifically, EcoFuzz adaptively assigns energy for unfuzzed seeds, i.e., the exploration process. With the rewards of all seeds, EcoFuzz estimates the expectation of a seed 𝑡𝑖 as (1 − 𝑝𝑖𝑖/ √𝑖), where 𝑝𝑖𝑖 is the self-transition probability (i.e., mutation of seed 𝑡𝑖 results in exercising the same path 𝑖). The logic is that a low self-transition probability indicates that mutations of the seed can discover other paths (new paths) with high probability.

Therefore, EcoFuzz prefers seeds with low self-transition probabilities. Similarly, AFL-HIER [88] also formulates the path transitions as a MAB problem. While EcoFuzz uses single metrics (i.e., edge coverage) to retain new seeds, AFL-HIER proposes to utilize multi-level coverage metrics, such as functions, edges, and basic blocks, to add new seeds. AFL-HIER chooses UCB1 [9], one of the MAB algorithms, to solve the MAB problem with multi-level coverage metrics.

3.2.4 Fitness by State Discovery. Both Markov chain and MAB formulate the state transitions of programs. However, the essential objective of fuzzing is to discover new states, e.g., new code coverage, new crashes, or new bugs. This motivates Böhme et al. [17, 20, 21] to formulate fuzzing processes as a species discovery problem [32, 33, 52]. In a nutshell, ecologists collect numerous samples from the wild, and the species in the samples may be abundant or rare. Ecologists extrapolate the properties of a complete assemblage, including undiscovered species, based on the samples. Similarly, inputs generated by fuzzers are the collected samples, and the input space of a program is the assemblage. Fuzzing categorizes inputs into different species based on specific metrics. For example, an execution path can be a species, and all inputs exercising the path belong to this species. In this case, a rare species is an execution path that a few inputs exercise. An important hypothesis in species discovery is that the properties of undiscovered species can almost be explained only by discovered rare species [31]. This hypothesis implies that fuzzing can assign more energy to the rare species (e.g., rare paths) to discover new states. Based on the problem of species discovery, Entropic [21] understands fuzzing as a learning process, i.e., a fuzzer gradually learns more information about the program behaviors (species). Entropic proposes to use Shannon’s entropy [170] to measure the efficiency of species discovery.

The original Shannon’s entropy 𝐻 measures the average information of species and is calculated as 𝐻 =− Í 𝑖 𝑝𝑖 𝑙𝑜𝑔(𝑝𝑖), where 𝑝𝑖 is the probability of selecting the species S𝑖 . The entropy 𝐻 is large (more information) if the collected samples contain many species; otherwise, the entropy 𝐻 is small (less information) if the collected samples contain a few species. Meanwhile, Entropic argues that the rate of species discovery quantifies the efficiency of fuzzing processes. Deduced from Shannon’s theory, Entropic measures the efficiency of species discovery for a seed. Specifically, 𝑝𝑖𝑡 is the probability of mutating a seed 𝑡 and generating an input that belongs to species S𝑖 . The learning rate of the seed 𝑡 is calculated based on the probability 𝑝𝑖𝑡 and an improved entropy estimator. Entropic concludes that more energy is assigned to seeds with a larger learning rate, i.e., seeds that discover new species more efficiently are assigned more energy.

3.3 Byte Schedule

The byte schedule determines the frequencies of selecting a byte in a seed to be mutated. Most fuzzersselectbytesheuristicallybasedonexecutioninformation[7,37,38,67,101,103,157,160,187] or randomly [22, 23, 118, 203, 206]. Byte schedule requires a more sophisticated understanding of program behaviors, such as path constraints or dataflow, than seed schedule. Therefore, fuzzers focus on a less complex problem called the importance of bytes, which implies how the bytes influence fuzzing processes. Because most greybox fuzzers use edge coverage to test PUTs, the first approach is to define the importance as how the bytes influence branch behaviors. NEUZZ [172] and MTFuzz [171] model the relationships between input bytes and branch behaviors based on deep learning (DL) models. The gradients of DL models quantify the importance of bytes because the large gradient of a byte indicates that a small perturbation of the byte results in a significant difference in branch behavior. For later mutations, fuzzing will prioritize the bytes with higher importance to be mutated.

Another approach to quantify the importance of bytes is to define it based on the fitness of seeds. As analyzed in §3.2, the fitness of a seed reflects the quality of the seed. Therefore, fuzzing can focus on bytes that can improve the quality. AFLChurn [214] utilizes Ant Colony Optimization (ACO) [60] to learn how bytes influence the fitness of seeds. Similar to the process that an ant colony searches for food, if the change of a byte improves the fitness of a seed, fuzzing increases the score of the byte. When fuzzing continues the test, the scores of all bytes decrease over time, which reflects the pheromone evaporation of ACO. As a result, fuzzing prefers to select the bytes with higher scores.

3.4 Mutation Operator Schedule

As depicted in Fig.2, the final step of the input generator is to choose a mutation operator (mutator) to mutate the selected byte(s). The mutation schedule decides which mutator to be selected for the next mutation of bytes. The motivation for mutation schedule is based on the observation that the eficiency of mutators varies [42, 116]. Classfuzz [42] argues that the mutators that have explored more new states have a higher probability of being selected. Thus, classfuzz assumes that Markov Chain Monte Carlo (MCMC) can model the process of mutation schedule. Classfuzz adopts the Metropolis-Hastings algorithm [45], one of the MCMC methods, to solve the problem of mutation schedule. Specifically, each mutator has a success rate that quantifies the new states explored by the mutator. Classfuzz first randomly selects the next mutator, and then it accepts or rejects the selection based on the success rates of both the current mutators and the selected ones.

MOPT [116] utilizes Particle Swarm Optimization (PSO) [94] to model the process of mutation selection. PSO uses several particles and gradually moves them towards their best positions. As to the problem of mutation schedule, a mutator is a particle, and the position is the probability of selecting the mutator. A particle finds its best position when the particle yields the most number of new states. at a position among all the other positions. Therefore, all particles (mutators) asymptotically find their own best positions (probabilities), which constructs the probability distribution for selecting mutators. Later fuzzing campaigns will select mutators based on the probability distribution.

3.5 Diverse Information For Fitness

Besides the schedule of seeds, bytes, or mutators, fitness can also be used to guide seed retention. Fuzzers usually utilize a genetic algorithm (GA) to formulate the process of seed retention. Specif-ically, a fuzzer generates an input by mutating a seed, and if the input explores new execution states (i.e., better fitness), the input is retained as a new seed. When selecting a seed for the next round of test (based on seed schedule), fuzzing may choose the new seed. Most coverage-guided fuzzers [7, 23, 116, 177, 204, 206] retain seeds based on edge coverage. In order to improve the ability of defect discovery, more sensitive code coverage is necessary to reveal more information of execution states. On the other hand, new types of fitness are designed for some specific scenarios, such as deep learning models [148] or robotic vehicles [82]. Note that the diversity of information is utilized for both seed retention and the schedule problems mentioned before.

Fig. 3. Sensitivity and limitation of code coverage. Edge coverage is limited to discover more execution states, including program defects.

3.5.1 Sensitive Code Coverage. The sensitivity of fitness indicates the ability to differentiate execution states. Many coverage-guided fuzzers [7, 23, 116, 177, 204, 206] implement a bitmap to provide fuzzing with the knowledge of edge coverage. Essentially, the bitmap is a compact vector, where the index of each element indicates an edge identifier. They calculate hash values ℎ𝑎𝑠ℎ(𝑏𝑖 , 𝑏𝑗) for the edge identifiers, where 𝑏𝑖 and 𝑏𝑗 are block identifiers that are randomly assigned during instrumentation. Although this implementation is fast during execution, it sacrifices the precision of edge coverage. Specifically, this implementation of obtaining edge coverage incurs the problem of edge collision, i.e., two different edges are assigned with the same identifier [66, 216]. As shown in Fig.3a, if edge identifiers 𝑖𝑑𝐴𝐵 = 𝑖𝑑𝐴𝐶 and 𝑖𝑑𝐵𝐷 = 𝑖𝑑𝐶𝐷 , then the paths 𝐴𝐵𝐷 and 𝐴𝐶𝐷 are regarded as the same one. Since fuzzing does not discover new coverage in this scenario, it will neglect the bug path 𝐴𝐶𝐷𝐸𝐺. Therefore, in order to assign unique edge identifiers, fuzzing assigns block identifiers carefully and proposes more precise hash functions [66].

With the bitmap, fuzzing can determine whether an input exercises new edges and retain the input as a new seed if it does. Specifically, fuzzers maintain an overall bitmap, which is a union of bitmaps of individual executions. When determining new edges, fuzzing compares an individual bitmap with the overall bitmap to check if new edges exist in the individual bitmap. However, the union of bitmaps loses information of executions [118]. For example, if paths 𝐴𝐵𝐷𝐸𝐺 and 𝐴𝐶𝐷𝐹𝐺 in Fig.3a have been exercised, the input that exercises the new path 𝐴𝐶𝐷𝐸𝐺 will not be retained as a seed because all edges have already existed in the overall bitmap. Therefore, one solution is to research on combining individual bitmaps [118]. Because the combination of bitmaps will introduce too many seeds, a critical challenge is to balance between the efficiency of fuzzing and the sensitive coverage. One potential solution is to use dynamic Principal Component Analysis [191] to reduce the dimensionality of the dataset [118]. Other solutions to improve the sensitivity of edge coverage include path hash [198], calling context [37, 87, 171], multilevel coverage [88], and code complexity [105], which add extra information for the edge coverage.

The improvement of the bitmap focuses on searching for more code coverage. However, such improvement may fail to explore complex execution states. For instance, Fig.3b is the code snippet of a labyrinth, where the pair (a, b) indicates the position in the labyrinth. In order to trigger the bug(), the pair (a, b) has to be the one with specific values. However, the switch snippet has only four edges and can be quickly explored. After that, fuzzing has no guidance for reaching the bug location. Therefore, an effective solution is to introduce the human-in-the-loop approach for the guidance of exploring complex execution states [6]. In the example of Fig.3b, an analyst can add an annotation in the code so that fuzzing will explore different values of the pair (a, b). Besides, the value of a comparison condition is more sensitive than the binary results, i.e., ‘examined’ or ‘not examined’ [37, 38, 65, 98, 137]. For instance, if the value of (x[3] - 0x44) in Fig.3a can be known, a fuzzer can select seeds that are more likely to satisfy the condition if (x[3] == 0x44).

3.5.2 Diverse Fitness. Fitness is not limited to code coverage. In fact, code coverage is not always practical or the best feedback for fuzzing campaigns. Therefore, researchers utilize different types of feedback for different applications or defects. If code coverage is not available, an intuitive solution is to retain seeds based on execution outputs, such as the legality of execution results [141] or state machine of protocol implementations [154]. Because fuzzing can be utilized to detect defects of various applications, different types of fitness are designed for specific applications [1, 82, 148] or defects [194]. The following summarizes the diverse kinds of fitness.

• Legality of execution result

An object-oriented program (e.g., Java) consists of a sequence of method calls, and the execution result either is legal or throws exceptions. Fuzzing generates and obtains new method call sequences that can explore more new and legal object states [141].

•State machine of protocol implementations

.A state machine consists of states and inputs that transform its states [161]. Due to the complexity of protocols, fuzzers usually infer the state machine by gradually adding new states into the state machine [55, 61, 64, 69, 154]. The state machine starts from a seed (i.e., an initial state machine), and fuzzers mutate the current states of state machine to explore new states. The vulnerabilities are analyzed based on the state machine, which intends to search for vulnerable transitions [55].

•Safety policy of robotic vehicles

The safety policies are the requirements for a robotic vehicle’s physical or functional safety, such as the maximum temperature of a vehicle’s engine [82]. When an input is closer to violations of policies, the input is retained for later mutation.

• Fitness for deep learning systems

The fuzzing of deep learning systems (DLSs) aims to improve the robustness and reliability of them [68, 115, 148]. To achieve this, fuzzers design different types of fitness, such as neuron coverage for discovering corner cases [148], loss function for augmentingthe training data [68], or operator-level coverage for exploring deep learning inference engines (i.e., frameworks and libraries) [115].

• Validation log of Android SmartTVs

The validation logs are the messages of executing Android SmartTVs [1]. The validation logs are utilized to infer valid inputs and extract input boundaries. The valid inputs provide fuzzing with effective seeds and the input boundaries reduce the search space of inputs.

• Behavioral asymmetry of differential testing

For differential testing, bugs are discovered via observing the discrepancies in the behaviors of different implementations, which have the same functionality, on the same input. The behavioral asymmetries indicate how discrepant the various implementations are. Fuzzing aims to generate test cases that can discover more discrepancies [138, 151].

• Alias coverage for data race

The alias coverage is designed to detect data races in kernel file systems [194]. A data race is a concurrent bug in which two threads access a shared memory location without proper synchronization. Therefore, the alias coverage tracks pairs of memory accesses that may interleave with each other.

•Dangerous locations for bugs

The dangerous locations are the code regions that have higher probability to trigger a bug. Therefore, fuzzers can steer fuzzing resource towards those locations to improve the effectiveness and efficiency of fuzzing. For concurrency bugs, the dangerous locations usually refer to code regions that result in atomicity violations [97], data races [84, 167], or suspicious interleavings [35]. For non-concurrency bugs, dangerous locations may be obtained by patch testing [122], crash reproduction [155], static analysis report [48], or information flow detection [123]. Moreover, the dangerous locations may be the memory accesses [84, 182, 188], sanitizer checks [40, 140], or commit history [214].

3.6 Evaluation Theory

The evaluation of fuzzing is usually conducted separately from the detection stage. However, we consider the evaluation as a part of the fuzzing processes because a proper evaluation can help improve the performance of fuzzing [215]. A proper evaluation includes effective experimental corpus [215], fair evaluation environment/platform [30, 104, 126], reasonable fuzzing time [17, 20], and comprehensive comparison metrics [96, 104]. Although these researches have made efforts on proper evaluations, it is still an open question about how to evaluate techniques (i.e., the fuzzing algorithms) instead of implementations (i.e., the code that implements the algorithms) [18]. A widely-used solution is to evaluate fuzzers based on a statistical test, which offers a likelihood that reflects the differences among fuzzing techniques [96].

Gap 1: The fuzzing theories narrow down the gaps between input space and defect space. Fuzzing theories formulate fuzzing processes based on program behaviors, such as bug arrival, state transition, and state discovery. Most theories formulate the process of seed schedule. Almost all fuzzers formulate seed retention based on genetic algorithm.

4 SEARCH SPACE OF INPUTS

As discussed in §3, fuzzers utilize optimization solutions to solve the search problem of input generation, which optimizes the search process in the input space. If a fuzzer can reduce the input space, it will also improve the performance of fuzzing. To achieve that, fuzzers group the related bytes in an input and apply specific mutators to each group. Suppose that an input includes 𝑎 × 𝑏 bytes and is equally divided into 𝑎 parts, then instead of 256𝑎×𝑏 , the search space of fuzzing is 𝑎 × 256𝑏 when resolving a specific path constraint. The related bytes can be the ones constructing

Fig. 4. Search space of input. The input space can be reduced by grouping related bytes. The relationships between bytes can be obtained based on path constraints or input specifications.

the same data structure [16, 201], influencing the same path constraint [37, 38, 65, 67, 157, 160, 187], or conforming to the same part of a grammar [78, 115, 120, 136, 181, 197, 212]. The mutators include byte mutation (e.g., bitflip, byte deletion, and byte insertion) [23, 206] and chunk mutation (e.g., chunk replacement, chunk deletion, and chunk insertion) [74, 77, 78, 181, 197]. As shown in Fig.4a, the whole input space can be divided into three parts, each of which is related to the variables 𝑖, 𝑘 and array 𝑎[], respectively. On the other hand, when solving a path constraint, a method focusing on the related bytes also reduces the search space.

For example, when solving the path constraint of line 14 in Fig.4a, the search space is only one byte if the constraint of line 13 satisfies. A special kind of input is the highly-structured input, which is utilized for applications such as protocol implementations, Document Object Model (DOM) engines, and compilers. As shown in Fig.4b, the cJSON parser requires that the segments of an input start with some specific characters. If an input violates the requirement, the input is not allowed to examine the functionalities guarded by the parser. Table 2 describes the approaches that reduce the search space and the relations utilized for grouping input bytes. The fuzzers in Table 2 are selected because their main contributions are the reduction of input space.

4.1 Byte-constraint Relation

For most path constraints, they are influenced by only a small part of an input. If a fuzzer only mutates the related bytes, the performance of fuzzing can be significantly improved by reducing the search space of inputs. For instance, fuzzing may generate 25611 inputs to satisfy the condition if (a[2] == 33) in Fig.4a with a random mutation. However, if 𝑝𝑜𝑠[10] is known to be the only byte that influences the value of 𝑎 [2], we can only mutate the byte 𝑝𝑜𝑠[10] to pass the condition.

As a result, only 256 inputs at most are required to resolve the path constraint. After obtaining the byte-constraint relation, a naive mutation scheme is to mutate the related bytes randomly [67, 157, 187]. A more uniform approach is to set the values of a byte from 0 to 255, respectively [172]. However, these two solutions are ineffective because they have no knowledge Table 2. Input space. Fuzzers reduce the input space by grouping the related bytes. The groups of bytes are obtained based on certain relation.

about the quality of inputs. If the inferring process of byte relation can obtain values of comparison instructions in a program, fuzzing can mutate related bytes and select inputs that make progress in passing path constraints. An input makes progress if it matches more bytes of a path constraint [65, 103]. Moreover, fuzzing can utilize a gradient descent algorithm to mutate related bytes and gradually solve path constraints [37, 38].

4.1.1 Dynamic Taint Analysis. Dynamic taint analysis (DTA) [51, 135] is a common technique for building the relationships between input bytes and path constraints. DTA marks certain data in inputs and propagates the labels during executions. If a variable in the program obtains a label, thevariable is connected to the data with the label [51]. Fuzzers [37, 38, 67, 157, 160, 187] utilizes DTA to build relationships between input bytes and security-sensitive points (e.g., system/library calls or conditional jumps).

4.1.2 Relation Inference. DTA requires heavy manual efforts and can also result in inaccurate relation due to the implicit data flows [65]. Because fuzzing examines target programs with numerous test cases, a light-weight solution is to infer the byte relation at runtime. One solution is to observe if the mutation of a byte changes the values of a variable [65], the results of a comparison instruction [7, 103, 200], or the hits of a branch [101]. If it does, the byte is linked to the variable, the comparison instruction, or the branch, respectively. Another approach for inference is to approximately build connections between input bytes and branch behaviors based on deep learning [172].

4.2 Concolic Execution

Concolic execution (also known as dynamic symbolic execution) regards the program variables as symbolic variables, tracks path constraints, and uses constraint solvers to generate a concrete input for a specific path [165]. In other words, concolic execution reduces the search space via directly solving the path constraints. Techniques utilizing both symbolic execution and fuzzing are called hybrid fuzzing or whitebox fuzzing. Hybrid fuzzing [72, 73, 153] utilizes fuzzing to exercise execution paths in target programs and uses concolic execution to solve constraints in those execution paths. Fuzzing suffers from solving path constraints due to its random nature. Although concolic execution is effective in solving path constraints, it is time-consuming to apply concolic execution for every execution path. Therefore, when fuzzing can no longer explore more states, concolic execution is utilized to solve constraints that fuzzing cannot satisfy [177].

One improvement of the hybrid fuzzing is to prioritize the most dificult path for concolic execution to solve [210]. In addition to the path selection, the performance of hybrid fuzzing can be improved via developing approximate constraint solvers. Generally, the satisfiability modulo theory (SMT) solvers (e.g., MathSAT5 [49] or Z3 [54]) are utilized to solve the path constraints. However, the SMT solvers have challenges in solving constraints due to the complex constraints or path explosion [177]. To mitigate the challenge, the constraint solver only symbolizes the path constraints that are influenced by inputs [46, 204]. Another improvement is based on the observation that many path constraints tend to be linear or monotonic [47]. Thus, the constraint solver performs in a greybox manner, e.g., it utilizes linear functions to approximate constraint behaviors. Interestingly, researchers have paid attention to solving path constraints via fuzzing [24, 108]. For example, JFS [108] translates SMT formulas to a program and utilizes coverage-guided fuzzing to explore the program. The SMT formulas are solved when a fuzzing-generated input reaches specific locations of the corresponding program.

Constraint solvers can also be improved based on features of targets. In terms of nested conditions (e.g., lines 13-15 in Fig.4a), Pangolin [81] proposes polyhedral path abstraction to resolve nested path constraints. The polyhedral path abstraction retains the solution space of historical constraints and reuses the solution space to satisfy the reachability for the current path constraints. For example, in order to solve the constraint in line 14 of Fig.4a, the input has to satisfy conditions in line 13 first. In order to utilize hybrid fuzzing in programs that require highly-structured inputs, Godefroid et al. [71] first symbolize tokens in grammar as symbolic variables. Then, it uses a context-free constraint solver to generate new inputs.

4.3 Program Transformation

For fuzzing, the program transformation aims to remove sanity checks that prevent fuzzing from discovering more execution states. By removing those checks, fuzzing can explore deep code intarget programs and expose potential bugs [150]. The removal introduces many false positives of bug locations, which can be further verified by symbolic execution. Thus, program transformation reduces the search space by focusing on inputs that may potentially trigger bugs.

4.4 Input Model

Many applications require highly-structured inputs, such as protocol implementations [3], Docu-ment Object Model (DOM) engines [197], JavaScript engines [78], PDF readers [74], system calls [77], and compilers [43]. An input model specifies the rules of constructing a highly-structured input, including the structure, format, and data constraints of inputs. In order to generate inputs that satisfy the specifications, the generation process is restricted to specific operations. If an input violates the syntax or semantics of target programs, the input will be rejected by the program at an early stage. In other words, the input space is subjected to the input model.

4.4.1 Accessible Models or Tools. Many fuzzers generate valid input files based on accessible input models [3, 56, 57, 100, 115, 136, 147, 181] or existing tools [75, 153]. Input generation based on bare specifications requires heavy engineering efforts. Moreover, it is error-prone because the specifications are complex to parse. The cJSON parser of Fig.4b is an implementation of JSON specification, which is error-prone, although simple, due to the complex parse of different data types. Therefore, the research community has open-sourced some tools for highly-structured inputs, e.g., QuickCheck[50] and ANTLR[146]. For example, NAUTILUS [5] and Superion [186] generate new inputs based on ANTLR. Then, both NAUTILUS and Superion use code coverage to optimize the mutation process. In some scenarios, the input model can simply be the type that the generated data will conform to (e.g., types of API arguments or physical signals) [1, 41, 70, 179]. For instance, the data for actuators of cyber-physical systems (CPSs) could be the binary values on or off [41].

4.4.2 Integration of Implementations. Another promising approach is to integrate fuzzing with the implementations of target applications [64, 89, 173]. Such integration allows fuzzing to examine the desired properties of target applications by customizing the process of input generation. For example, TLS-Attacker [173] creates a framework that can mutate inputs based on the type of each segment, and manipulate the order of protocol messages. This framework includes a complete Transport Layer Security (TLS) protocol implementation.

4.4.3 Intermediate Representation. A more complex approach is to convert the input model into an intermediate representation (IR). For highly-structured inputs, mutations on the original input files are too complex to maintain the syntax and semantics. Therefore, researchers translate the original files into IR, which is simpler and more unified. Fuzzers mutate the IR and then translate the mutated IR back to the original input formats. This mutation strategy can maintain syntactic or semantic correctness and generate more diverse inputs. For instance, IR is utilized to test database management systems (DBMSs) [212], examine DOM engines [197], or fuzz different language processors (e.g., compilers or interpreters) [43].

4.5 Fragment Recombination

Based on input specifications, another solution for input generation is to generate new input files via fragment recombination. The basic idea of fragment recombination is to separate input files into many small chunks (i.e., fragments), and then generate a new input file via combining small chunks from different input files. Each fragment conforms to the specifications of inputs so that the recombined input files are syntactically correct. Ideally, the reassembled input file will exercise a new execution path or expose a new bug.

Fig. 5. Fragment Recombination. Input files are often parsed into trees (e.g., ASTs), and the fragments in the trees will be recombined for generating a new input file. Each fragment conforms to the specifications of inputs.

Fig. 6. Semantic error (JavaScript). Line 2-5 intro-duces semantic error because the method errf() is not defined.

As depicted in Fig.5, fuzzers first parse an input file into a tree (e.g., abstract syntax tree (AST)), which remains the syntactic correctness. In order to properly parse inputs, the input corpus is required to be valid [78, 180, 185, 199]. One way to gather a valid input corpus is to download files from the Internet [26]. Besides validity, fuzzers also collect problematic inputs, which have caused invalid behaviors before, for the input corpus [80, 99, 145]. The basic assumption is that a new bug may still exist in or near the locations where an input has already discovered a bug [80]. The problematic inputs have already exercised complex execution paths that cause invalid behaviors. Therefore, the recombination of fragments may exercise the same or neighbor complex paths, which helps fuzzing explore deep code lines. In the second stage, the input files are separated into many fragments, which are stored in a fragment pool. Because fuzzers parse inputs into ASTs, the fragments can be sub-trees that contain non-terminals. When recombining fragments, a newly generated input file is required to be syntactically correct. Therefore, fuzzers recombine syntactically compatible fragments based on random selection [80, 120, 199], genetic algorithm [180], or machine learning [185]. In addition to syntactic correctness, semantic correctness also has a significant influence on the effectiveness of fuzzing. For instance, in order to generate syntactically and semantically correct JavaScript inputs, CodeAlchemist [78] tags fragments with assembly constraints. That is, different fragments are combined only when the constraints are satisfied.

4.6 Format Inference

If input models are not accessible, inferring the format of inputs is a promising solution to generate valid inputs. Moreover, one input model can only generate inputs with a specific format. In order to support more formats of inputs, developers have to utilize new input models and select the corresponding input model when generating inputs. Therefore, format inference is more scalable than model-based approaches.

4.6.1 Corpus-based. In order to infer the formats, a straightforward approach is to learn from a valid input corpus. Due to the lack of input models, researchers build end-to-end deep learning models to surrogate the input models. Recurrent neural network (RNN) [127] is a preferable deep learning model for fuzzers to generate structured inputs [74, 79, 111, 112]. However, the surrogate solution may suffer from generating invalid inputs. For example, the highest rate for DeepFuzz [112] to generate valid syntax inputs is only 82.63%. To improve the rate of generating valid inputs, the training data needs to be refined accordingly. For example, when generating PDF files, the training data is composed of sequences of PDF objects instead of textual data [74]. As for smart contracts,the training data is about the sequences of transactions [79]. Similarly, LipFuzzer [208] trains adversarial linguistic models to generate voice commands, where the training data is presented through a linguistic structure. Besides, fuzzing can synthesize a context-free grammar (e.g., regular properties such as repetitions and alternations) based on a valid input corpus [12]. The synthesized grammar is then utilized for generating highly-structured inputs.

4.6.2 Coverage-based. The corpus-based solution requires the training data to have comprehensive coverage of input specification, which may not be practical [200]. Besides, it does not use knowledge from internal execution states (e.g., code coverage), which may cause low code coverage. Essentially, the format of inputs indicates the relationships among different bytes in inputs. Therefore, based on the code coverage, fuzzers infer the byte-to-byte relations to boost fuzzing [16, 201]. For instance, GRIMOIRE [16] uses code coverage to infer the format required by target programs. It aims to identify the format boundary of an input. Specifically, it changes some bytes in an input and checks whether the changes result in different code coverage. If the code coverage remains the same, the positions where the bytes are mutated can be randomly mutated. Otherwise, the positions are required to be carefully mutated. ProFuzzer [201] first defines six data types that cover most input content. Then, based on the distribution of edge coverage, it infers the types of each byte and merges the consecutive bytes that belong to the same type.

4.6.3 Encoding Function. Different from all the aforementioned approaches that focus on inputs, some fuzzers search for the code regions that encode the formats of inputs [36, 92, 159]. Because such code regions correlate to the generation of well-structured inputs, fuzzers perform mutation before encoding the formats. Although the source code of PUTs may not be accessible, their corresponding implementations that generate well-structured inputs are often accessible [92]. For example, the community open-sources some tools for generating highly-structured inputs [50, 146] (§4.4.1). As for the Internet of Things (IoT) devices, most of them are controlled through companion applications, which form messages communicating with target devices [36]. By locating the code regions related to encoding formats, mutations can be operated on the arguments of functions [36, 159] or the instructions calculating the formats [92]. For instance, IoTFuzzer [36] hooks such functions and mutates the data for arguments of those functions.

4.7 Dependency Inference

The format inference mainly intends to satisfy the syntactic requirements, which may still generate inputs with incorrect data dependency. For instance, in the code snippet 2 of Fig.6, an error occurs in the line 2-5 because the method errf() is not defined. Many applications require correct data dependency in inputs, which usually consist of sequences of statements. The sequences include systemcallsforkernelcode[77,95,142],objectsforprocessorsofobject-orientedprograms[59,117], application programming interfaces (APIs) for services/libraries [8, 83, 110], or application binary interfaces (ABIs) for smart contracts [86]. On the one hand, most of these applications require definition/declaration before using the data in inputs, as depicted in Fig.6. On the other hand, the outputs of executing some statements are the data of arguments for some other statements.

4.7.1 Documents or Source Code. The data dependency of sequences is usually inferred via static analysis. Because many applications have documents or source code describing their interfaces, researchers infer the data dependency based on those resources [8, 53, 59, 86, 110, 117]. The resources contain information about how to use an interface and the prerequisites for the interface. When fuzzing generates inputs including an interface, fuzzing is also required to generate the prerequisites for the interface. Otherwise, the generated input will be refused at an early stage. However, the static analysis introduces high false positives and misses dependencies of interfaces.

Therefore, when having access to the source code of PUTs, a better solution is to combine static analysis and dynamic analysis [95].

4.7.2 Real-world Programs. Many real-world programs implement code lines to call interfaces, which have already considered the data dependency of interfaces. Therefore, fuzzing can synthesize new programs that call the interfaces based on program slicing of the real-world programs [10]. Besides, the data dependency can be inferred by analyzing execution logs of those real-world programs [77, 83, 142]. The execution logs explicitly contain the ordering information of interfaces, i.e., the information about which interface is executed first. Moreover, the execution logs implicitly contain the information of argument dependency between interfaces. In order to obtain explicit and implicit information, fuzzing hooks each interface when executing a real-world program and records desired information.

Gap 2: The reduction of input space relies on grouping the input bytes that are syntactically and/or semantically related. The advantage of grouping bytes is the improvement of eficiency in exploring more execution states. That is, fuzzing is more likely to satisfy path constraints so that it explores deep code regions guarded by these constraints.

5 AUTOMATION

Automatic execution is the basis of applying fuzzing theory and input space reduction approaches. Fuzzing repeatedly executes PUTs and monitors executions for exceptions, and the exceptions are further verified about whether they are bugs or not. Therefore, in order to succeed in fuzzing, the first requirement is to run PUTs automatically and repeatedly. Most fuzzers have already succeeded in testing command-line software, but they cannot be directly utilized for other targets, such as hardware or polyglot software [18]. The second requirement is the automatic indicator of potential bugs. Currently, fuzzing uses crashes as a sign of potential bugs. Yet, many bugs are not manifested as crashes, such as data races. Because the nature of fuzzing is to test targets repeatedly, the third requirement is the high execution speed of fuzzing. A higher execution speed indicates that more test cases are examined in the same time budget, which offers higher opportunity to find defects.

5.1 Automatic Execution of PUTs

Fuzzing is utilized for diverse applications, which require different engineering efforts to automatize the fuzzing processes. This section introduces several kinds of applications where fuzzing has successfully automatized the testing.

5.1.1 Command-line Programs. Fuzzing has achieved a great success in testing command-line programs (e.g., the “general” applications in Tables 1 and 2). It runs PUTs in a sub-process and feeds the PUTs with the required options and inputs [23, 206, 214]. In order to improve the execution speed, fuzzing does not replay all steps for executing a PUT. Instead, it clones a child process so that it skips pre-processing steps, such as loading the program file into memory [206, 216]. Usually, fuzzing takes only one command option for all fuzzing campaigns, i.e., all generated inputs are executed based on one option. Because different options indicate different code coverage, a thorough test needs to enumerate all command options during fuzzing. An eficient scheme is that, if an input is invalid for one option, fuzzing skips testing all the rest options [176]. An important observation for this scheme is that if an input is invalid for one option, the input will fail all the other options.

5.1.2 Deep Learning Systems. The fuzzing of deep learning systems (DLSs) is similar to the test of command-line programs. DLSs are tested by fuzzing-generated inputs, and fuzzing intends togenerate inputs for better fitness [68, 115, 148]. The inputs are training data, testing data, or even deep learning models based on different targets. On the other hand, the fitness can be neuron coverage, loss function, or operator-level coverage, as described in §3.5.2. As for testing DLSs, fuzzing will not only detect defects [115] but also examine the robustness of models [68, 148].

5.1.3 Operating System Kernels. The Operating System (OS) kernels are complex compared to general command-line programs. Kernels include many interrupts and kernel threads, resulting in non-deterministic execution states. In order to fuzz the kernels in the way as command-line programs, fuzzing utilizes a hypervisor (e.g., QEMU or KVM) to run a target kernel [143, 164]. Meanwhile, the code coverage is obtained via Intel’s Processor Trace (PT) technology. Although this approach can fuzz various target kernels with feedback, it still requires manually constructing syntactically and semantically correct inputs. Because the inputs for the kernels include file system images or a sequence of syscalls, fuzzers can test kernels in a more light-weighted manner. That is, after the data dependency of syscalls are analyzed or inferred (§4.7), fuzzers generate sequences of syscalls to run on target kernels [53, 77, 95, 142, 183, 194, 196]. Then, fuzzers monitor whether sequences of syscalls incur system panics that indicate potential bugs in the target kernel. Another way to fuzz OS kernels is to emulate the external devices. Because the kernel communicates with the emulated devices, fuzzers can generate inputs to test the drivers in kernels [149].

5.1.4 Cyber-Physical Systems. Cyber-Physical Systems (CPSs) include two major components that are tightly integrated, i.e., the computational elements and the physical processes [41]. A widely-used computational element is the Programmable Logic Controller (PLC), which controls actuators to manage the physical processes and receives inputs from sensors. Thus, when fuzzing CPSs, fuzzers can replace PLCs and directly send numerous commands to actuators through the network [41]. Another way to fuzz CPSs is to examine the control applications and the runtime of PLCs [179]. However, the PLC binaries cannot be fuzzed in the same way as fuzzing command-line programs. Because PLC applications have various binary formats and complex communication with physical components, the automation of these applications varies. Based on the analysis of PLC binaries and their development platforms (e.g., Codesys), it is possible to automatically fuzz PLC binaries when running them on PLC devices [179].

5.1.5 Internet of Things. The automation of fuzzing Internet of Things (IoT) devices includes emulation [34, 205, 211] and network-level test [36, 63, 159]. The emulators [34, 205] can execute programs, which originally run on IoT firmware, without the corresponding hardware. With the help of emulators, fuzzers run target programs in a greybox manner [211]. On the other hand, the network-level fuzzing examines IoT devices in a blackbox manner. Because IoT devices can communicate with the outside world through networks, fuzzers automatically send messages (requests) to the IoT devices and wait for the execution results from IoT devices (responses) [36, 63, 159]. By categorizing the responses, the fitness is the number of categories, i.e., the purpose is to explore more categories [63].

5.1.6 Applications with Graphical User Interface. The execution speed of applications with Graphi-cal User Interface (GUI) is much slower than command-line programs [91]. Since the execution speed is one of the keys for the success of fuzzing, the automation of GUI applications often replaces the GUI with a faster approach and executes targets in the command-line manner [91, 106, 121]. For instance, fuzzers can model the interactions of user interfaces so that they generate event sequences for Android applications [106, 121]. Moreover, fuzzers can also utilize a harness, which prepares the execution context, to directly invoke target functions in GUIs [91].

5.1.7 Applications with Network. Some applications receive inputs (messages) through the network, such as smart contracts [79, 86, 137, 193], protocol implementations [55, 61, 64, 69, 154], cloud services [8], Android Native System Services [1, 110], or robotic vehicles [82]. Therefore, an input can be generated locally, and the execution of the target applications can be performed remotely. The eficiency of the automatic testing relies on the quality of generated inputs as well as the fitness that reflects the execution states. For instance, the inputs for smart contracts are sequences of contract transactions, i.e., messages between different accounts. When receiving the transactions, functions in smart contracts are executed on their infrastructure of the blockchain [79, 86, 137, 193].

5.2 Automatic Detection of Bugs

Many security bugs can be exploited to control over systems, leak private data, or break down servers [131]. The challenge to detect bugs is that bugs are unpredictable. Specifically, a detector does not know the bug locations, and even does not know whether a bug exists in a target program before testing it. Therefore, it is critical to record a potential bug during fuzzing automatically. Usually, the indicators are crashes of program execution but some other indicators are designed based on bug patterns. Although bug patterns can only be utilized to expose specific types of bugs, it is highly effective if such types of bugs exist in target programs [143]. This section mainly introduces six types of bugs that are successfully detected by fuzzing. They are memory-violation bugs, concurrency bugs, algorithm complexity, Spectre-type bugs, side-channels, and integer bugs.

Fig. 7. Memory violation bugs (adapted from [175]).

5.2.1 Memory-violation Bugs. Memory-violation bugs are among the oldest and most severe security bugs [131, 175, 178]. A program is memory safe if pointers are restricted to accessing intended referents. Memory safety violations can be classified into two categories, i.e., spatial safety violations and temporal safety violations [175]. A spatial safety violation gets access to memory that is out of bounds, while a temporal safety violation accesses an invalid referent. For instance, Fig.7a is a spatial violation because the size of buf is larger than str (buffer overflow). On the other hand, Fig.7b is a temporal violation because the pointer p is used after it has been freed (use-after-free). Although many approaches have been proposed to mitigate the influence of memory violations, most of them are not used in practice due to the disadvantages such as performance overhead, low compatibility, and low robustness [178].

Buffer overflow is one kind of memory-violation bugs and writes bytes out of bounds (e.g., example of Fig.7a). Dowser [76] argues that typical buffer overflows are caused mainly by accessing an array in a loop. In order to detect buffer overflows in loops, Dowser ranks instructions that access buffers in loops, and prioritizes inputs that exercise higher ranking accesses. It then uses taint analysis and concolic execution to solve path constraints for the selected inputs. Since Dowser focuses on arrays in loops, only a small number of instructions are required to be instrumented. Such focus improves the execution speed of both taint analysis and concolic execution. The use-after-free (UaF) bug is another type of memory violation, i.e., temporal safety violation. As shown in Fig.7b, an UaF bug consists of at least three statements: 1) allocating memory to a pointer, 2) releasing the corresponding memory, and 3) reusing the pointer. This bug pattern motivates UAFL [184] to generate inputs that can gradually cover an entire sequence of a potential UaF. The potential UaF sequences are obtained by static typestate analysis based on the bug pattern.

Fig. 8. Non-deadlock concurrency bugs (adapted from [114]).

5.2.2 Concurrency Bugs. Another severe security bug is the concurrency bug, which occurs when concurrent programs run without proper synchronization or order. Generally, concurrency bugs are categorized into deadlock bugs and non-deadlock bugs [114]. A deadlock bug occurs when operations in a program wait for each other to release resources (e.g., locks). The non-deadlock concurrency bugs mainly include atomicity-violation bugs and order-violation bugs [114]. An atomicity-violation bug violates the desired serializability of a certain code region. For example, in Thread 1 of Fig.8a, line 3 is required to be executed after line 1 so that fputs() is called with a valid argument. However, line 2 in Thread 2 is executed before line 3 in Thread 1. Thus, p->info is set to NULL before fputs() is called, which incurs an error. On the other hand, an order-violation bug occurs when two (or more) memory locations are accessed with an incorrect order. For instance, in Fig.8b, mThd->State in Thread 2 is executed before the initialization of mThd in Thread 1, which incurs an error called the use of uninitialized variables. Concurrency bugs can also result in memory violations, such as use-after-free and double-free [28].

One solution to discover deadlocks is to detect cycles on a lock order graph, of which each node represents a lock [4]. If a cycle exists in the graph, a potential deadlock is detected [90]. In order to improve the efficiency and scalability of the cycle detection, MagicFuzzer [27] iteratively removes locks in the graph if those locks are not included in any cycle. MagicFuzzer then checks the remaining cycles based on a random scheduler. As for atomicity violations, ATOMFUZZER [144] observes a typical bug pattern that a lock inside an atomic block is repeatedly acquired and released by two threads. Specifically, if a thread 𝑡 inside an atomic block is about to acquire a lock 𝐿 that has been acquired and released by 𝑡 before, ATOMFUZZER delays the execution of thread 𝑡 and waits for another thread 𝑡′ to acquire the lock 𝐿. It is an atomicity violation when the other thread 𝑡′ acquires the lock 𝐿 during the delay of thread 𝑡.

More generally, concurrency bugs occur due to the incorrect interleavings of threads. The challenge is that concurrent programs may have too many interleavings to examine each one (i.e., the problem of state-explosion). CalFuzzer [166] mitigates the state-explosion based on the fact that some interleavings are equivalent because they are from different execution orders of non-interacting instructions. The equivalence indicates that executions of them will result in the same state. CalFuzzer randomly selects a set of threads whose following instructions do not interact with each other and executes those instructions simultaneously. Therefore, CalFuzzer [166] can examine different interleavings more eficiently.

Fig. 9. Algorithm complexity (adapted from [152]).

Fig. 10. Spectre-type bug (adapted from [139]).

5.2.3 Algorithmic Complexity. Algorithm complexity (AC) vulnerabilities occur when the worst-case complexity of an algorithm significantly reduces the performance, which could result in Denial-of-Service (DoS) attacks. Fig.9 shows an example that when given different inputs to the argument array, the algorithm has different complexities. For example, if the value of array is [8,5,3,7,9], the algorithm will execute 37 lines of code (LOC). On the other hand, if array is [1,5,6,7,9], it will cause the execution of 67 LOC. The increase of LOC requires more computing resources. Thus, the worst-case behavior could be exploited by attackers to launch a DoS attack. Therefore, SlowFuzz [152] detects AC bugs via guiding fuzzing towards executions that increase the number of executed instructions. Similarly, HotFuzz [15] detects AC bugs in Java methods via maximizing the resource consumption of individual methods. MemLock [44] detects AC bugs based on both metrics of edge coverage and memory consumption. It steers fuzzing towards inputs that can either discover more edges or consume more memories. The aforementioned fuzzers directly generate the worst performance inputs (WPIs) for AC bug discovery. On the contrary, Singularity [190] synthesizes programs for input generation based on the observation that these WPIs always follow a specific pattern.

5.2.4 Spectre-type Bugs. A Spectre-type bug is a microarchitectural attack that exploits mispre-dicted branch speculations to control memory accesses [139]. For instance, in Fig.10, an attacker can send several in-bound values for the variable input, which will train the branch predictor to speculate if the check in line 2 is always true. When the attacker sends an out-of-bound value for input, the predictor will incorrectly predict the branch behavior, and lines 3-4 are speculatively executed (i.e., they are executed without the check in line 2). Since the input actually does not satisfy the check in line 2, the execution of lines 3-4 results in buffer overread. Therefore, SpecFuzz [139] instruments target programs to simulate the speculative execution, which can forcefully execute the mispredicted code paths. Then, any invalid memory access in the mispredicted paths can be triggered.

5.2.5 Side channels. Side-channel bugs leak secret information via observing non-functional behaviors of a system (e.g., execution time). For example, if a secret is the variable a in the statement ‘𝑖 𝑓 (𝑎 > 0){...}𝑒𝑙𝑠𝑒{...}’, one can observe the execution time of the then-branch and else-branch to tell whether the value of a is larger than zero. A special kind of side channels is called JITinduced side channels, which is caused by Just-In-Time (JIT) optimization [25]. Similar to the aforementioned Spectre-type bugs, one can repeatedly run programs to train the JIT compiler to optimize the execution time of either the then-branch or the else-branch. Then, the execution time of the trained branch (e.g., the then-branch) and the untrained branch (e.g., the else-branch) will be biased enough to be observable. As a result, the secret value of the variable a is leaked.

5.2.6 Integer Bugs. Integer overflow/underflow bugs occur when the value of an arithmetic expres-sion is out of the range that is determined by a machine type. On the other hand, integer conversion bugs occur when incorrectly converting one integer type to another integer type. In order to detect integer bugs, SmartFuzz [132] adds specific constraints according to different integer bugs into the symbolic emulation. Then, the symbolic solver intends to generate concrete inputs that may trigger integer bugs.

5.3 Improvement of Execution Speed

Execution speed is critical to fuzzing as fuzzing runs numerous test cases in a limited time budget. A higher execution speed means that fuzzing can examine more test cases, which offers a higher opportunity to find defects. Therefore, researchers put many efforts to improve the execution speed of fuzzing, including binary analysis [58, 134], optimized execution processes [46, 133, 204, 216], and application-specified techniques [91, 162, 163, 174, 195, 196, 211].

5.3.1 Binary Analysis. As a pre-process, fuzzing mainly utilizes static instrumentation to obtain executionstatesbecausestaticinstrumentationprovidesfuzzingwithhighexecutionspeed[58,134]. A widely used static analysis tool is LLVM [113], which instruments programs during compilation. When it comes to closed-source applications, fuzzers are restricted to binary analysis [124] because source code is not available. The problem is that binary instrumentation tools (e.g., Dyninst [128]), which succeed in many domains (e.g., emulation), suffer from run-time overhead when applied to fuzzing. To improve execution speed and provide fuzzing with compiler-level performance, RetroWrite[58]proposestousestaticbinaryrewritingtechniquesbasedonreassembleableassembly. It focuses on instrumenting binaries of 64-bit position independent code (PIC) via utilizing PIC’s relocation information to instrument assembler files. The performance overhead is reduced because RetroWrite can instrument inlined code snippets. Although fast, RetroWrite only supports 64-bit PIC binaries. In order to uphold both low run-time overhead and scalability, FIBRE [134] streamlines instrumentation through four IR-modifying phases. The four phases instrument programs via static rewriting, inlining, tracking register liveness, and considering various binary formats. The aforementioned rewriting techniques only rewrite binaries for once, which may result in unsound binary rewriting, especially for stripped binaries [189]. To solve this problem, STOCHFUZZ [209] proposes the incremental and stochastic rewriting techniques based on the fact that fuzzing executes target programs repetitively. Specifically, STOCHFUZZ rewrites target binaries for multiple times, and it gradually fixes problems introduced by previous rewriting results.

5.3.2 Execution Process. Execution speed can also be improved during the fuzzing campaign. UnTracer [133] observes that most test cases generated during fuzzing do not discover new coverage. This indicates that tracing all test cases, which AFL uses, incurs significant run-time overhead. Therefore, UnTracer only traces the coverage-increasing test cases to improve execution speed. This is accomplished by inserting interrupts at the beginning of basic blocks. When a block is examined, UnTracer removes the instrumentation at the block so that the execution will not be interrupted at the block in the future. Since block coverage loses information of execution states, CSI-Fuzz [216] utilizes edge coverage to improve UnTracer. Besides, Zeror [213] improves UnTracer via adaptively switching between Untracer-instrumented binary and AFL-instrumented binary. For hybrid fuzzing, concolic execution is utilized to resolve path constraints. Yet, the symbolic emulation in concolic execution is slow in formulating path constraints, which is the main factor that hybrid fuzzing suffers from scaling to real-world applications. QSYM [204] mitigates the performance bottleneck via removing some time-consuming components, such as IR (intermediate representation) translation and snapshot. Moreover, it collects and solves only a portion of path constraints. Although the concrete inputs generated by QSYM may not be the exact solution for path constraints, QSYM uses fuzzing to search for valid inputs via mutating those concrete inputs. Intriguer [46] observes that QSYM still suffers from performance bottleneck because QSYM solves many unnecessary constraints. Intriguer then performs symbolic emulation for more relevant instructions, which is determined by dynamic taint analysis. In addition to instrumentation and hybrid fuzzing, another optimization is to improve execution speed in the parallel mode. Xu et al. [195] observe that AFL [206] significantly slows down when it runs on 120 cores in parallel. This motivates them to design new operating primitives to improve the execution speed.

5.3.3 Various Applications. Besides general applications, fuzzing is also used to detect defects in targets from different fields, such as Internet of Things (IoT), operating system (OS) kernels, and Virtual machine monitors (VMMs). Since these targets usually have special features, fuzzing is customized for the targets to perform testing in an eficient manner.

Although emulation is a promising approach to fuzz IoT firmware, full-system emulation suffers from low throughput. The run-time overhead of full-system emulation mainly comes from translat-ing virtual addresses for memory accesses and emulating system calls. FIRM-AFL [211] mitigates the overhead via combining user-mode emulation and full-system emulation, and it mainly runs programs in the user-mode emulation. In order to fuzz VMMs (i.e., hypervisors), Schumilo et al. [162, 163] design a customized OS and a fast snapshot restoration mechanism to conduct fuzzing eficiently. As to the file systems, mutating a whole disk image degrades the fuzzing throughput significantly because an image is too large. To solve this challenge, JANUS [196] only mutates the metadata of a seed image, i.e., it exploits the properties of structured data. This solution re-duces the search space of inputs, resulting in improvement of throughput. OS kernels can also be compromised through a peripheral device, i.e., vulnerabilities occur along the hardware-OS boundary. In order to detect the defects in device-driver communication, PeriScope [174] proposes to fuzz based on the kernel’s page fault handling mechanism. Windows applications are different from Linux’s because they heavily use graphical interfaces, and Windows lacks approaches to clone a process quickly. WINNIE [91] synthesizes a harness to run applications without graphical interfaces. Moreover, it implements fork() for Windows to clone processes eficiently. BigFuzz [207] transforms data-intensive scalable computing (DISC) applications to a semantically equivalent program, which is independent from the DISC framework. Because the DISC framework introduces long latency, the framework-independent execution significantly improves the execution speed.

Gap 3: The automatic execution of applications is based on the sophisticated understanding of those applications. When designing indicators for automatic records of security flaws, the properties of those flaws have to be investigated first.

6 DIRECTIONS OF FUTURE RESEARCH

Fuzzing has attracted much attention from the research community. It has discovered thousands of real-world bugs and severe vulnerabilities in recent decades. We summarize some directions for future research on fuzzing.

More sensitive fitness. Researchers have made many efforts in improving the eficiency and effectiveness of code coverage, especially the sensitivity of code coverage (§3.5.1). Recently, re-searchers realize that code coverage has its limitation in discovering complex bugs. Therefore, they extend code coverage via introducing information (e.g., dangerous code regions) that is obtained by analyzing bugs. Future works can analyze bugs and detect them based on features of bugs, especially analyzing the bugs that evade from current fuzz testing.

More sophisticated fuzzing theories. Current fuzzing theories partially formulate fuzzing processes (§3). Most of the existing works intend to formulate the seed schedule while much fewer works pay attention to other fuzzing processes. Due to the complication of fuzzing processes, few of the existing works formulate the entire fuzzing process. It is non-trivial to mathematically formulate the entire fuzzing process. However, it is possible to formulate more than one fuzzing process, such as Game Theory considering both seed schedule and byte schedule. A bigger picture is about the theoretical limitation of fuzzing (e.g., the limitation of greybox fuzzing). On the other hand, formulating fuzzing processes with multiple types of fitness is another way to build more sophisticated fuzzing theories. For example, future works may formulate fuzzing processes considering both the arrival of bugs and the state transitions.

Sound evaluation. A few works focus on the soundness of evaluation but no firm conclusion has been made (§3.6). These works only provide suggestions for a sound evaluation, such as time budget or evaluation metrics. More questions remain open to be answered. Should we use synthesized bugs or real-world bugs for the evaluation corpora? Is the statistical test the final answer to differentiate two fuzzing techniques? What is a reasonable time budget to terminate the fuzzing processes? How do we evaluate special target applications, such as hardware, when no comparison fuzzer exists?

Scalable input inference. The eficiency of fuzzing can be significantly improved when the format or data dependency is used during fuzzing (§4.6 and §4.7). Static analysis is widely-used for both format inference and data dependency inference. However, static analysis is application-specific, i.e., the implementations of inference approaches are required to consider the features of different applications. Currently, dynamic analysis focuses on format inference and few works make efforts in data dependency inference. Inference approaches with dynamic analysis are possible to be utilized for multiple applications, i.e., dynamic analysis is more scalable than static analysis. More researches may focus on the inference of data dependency based on dynamic analysis.

Eficient mutation operators. Almost all fuzzers utilize fixed mutators during fuzzing. That is, fuzzers design some mutators in advance based on features of target applications and the mutators are not changed during fuzzing (§4). A few works intend to optimize the mutator schedule but no one focuses on changeable mutators (§3.4). Is it possible to design evolved mutators that are changed during fuzzing to improve performance? Because mutator schedule is tightly interacted with byte schedule, it may be promising to design mutators considering byte schedule. Moreover, mutators for highly-structured inputs may have different attribution to general applications. Thus, mutator schedule for highly-structured inputs may also be worth of being studied.

More types of applications. Fuzzing has achieved great success in detecting bugs in command-line programs. Researchers also make many efforts in fuzzing more types of applications (§5.1). Due to the complexity of different applications, fuzzing has its limitation in detecting more types of applications in practice. For example, a few works explore the possibility to fuzz cyber-physical systems, but the fuzzing capability is limited [41, 179]. Because the execution speed is critical for fuzzing, a potential direction for the hard-fuzzing applications is to improve their execution speed.

More types of bugs. Fuzzing has already successfully detected bugs such as memory-violation bugs, concurrency bugs, or algorithmic complexity bugs (§5.2). However, it has dificulties in detecting many other types of bugs, such as privilege escalation or logical bugs. The challenge is to design proper indicators for these bugs so that they are automatically recorded during fuzzing. Because such indicators reflect the features of corresponding bugs, the design of the indicators requires researchers to have a good understanding of both fuzzing and target bugs. For instance, programs will run without exceptions even when they trigger logical bugs. In order to design automatic indicators for logical bugs, it requires a profound understanding of the functional requirements that are used to develop the code.

REFERENCES

[1] Yousra Aafer, Wei You, Yi Sun, Yu Shi, Xiangyu Zhang, and Heng Yin. 2021. Android SmartTVs Vulnerability Discovery via Log-Guided Fuzzing. In 30th USENIX Security Symposium. 1–18.

[2] Humberto Abdelnur, Radu State, Obes Jorge Lucangeli, and Olivier Festor. 2010. Spectral Fuzzing: Evaluation & Feedback. Technical Report. https://hal.inria.fr/inria-00452015

[3] Humberto J Abdelnur, Radu State, and Olivier Festor. 2007. KiF: a stateful SIP fuzzer. In Proceedings of the 1st international Conference on Principles, Systems and Applications of IP Telecommunications. 47–56.

[4] Rahul Agarwal, Liqiang Wang, and Scott D Stoller. 2005. Detecting potential deadlocks with static analysis and run-time monitoring. In Haifa Verification Conference. Springer, 191–207.

[5] Cornelius Aschermann, Patrick Jauernig, Tommaso Frassetto, Ahmad-Reza Sadeghi, Thorsten Holz, and Daniel Teuchert. 2019. NAUTILUS: Fishing for Deep Bugs with Grammars. In The Network and Distributed System Security Symposium (NDSS). San Diego, CA, 1–15.

[6] C. Aschermann, S. Schumilo, A. Abbasi, and T. Holz. 2020. IJON: Exploring Deep State Spaces via Fuzzing. In IEEE Symposium on Security and Privacy (S&P). IEEE Computer Society, Los Alamitos, CA, USA, 874–889.

[7] Cornelius Aschermann, Sergej Schumilo, Tim Blazytko, Robert Gawlik, and Thorsten Holz. 2019. REDQUEEN: Fuzzing with Input-to-State Correspondence. In The Network and Distributed System Security Symposium (NDSS). San Diego, CA, 1–15.

[8] Vaggelis Atlidakis, Patrice Godefroid, and Marina Polishchuk. 2019. RESTler: Stateful REST API Fuzzing. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 748–758.

[9] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2 (2002), 235–256.

[10] Domagoj Babić, Stefan Bucur, Yaohui Chen, Franjo Ivančić, Tim King, Markus Kusano, Caroline Lemieux, László Szekeres, and Wei Wang. 2019. Fudge: fuzz driver generation at scale. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 975–985.

[11] Greg Banks, Marco Cova, Viktoria Felmetsger, Kevin Almeroth, Richard Kemmerer, and Giovanni Vigna. 2006. SNOOZE: Toward a Stateful Network Protocol Fuzzer. In International conference on information security. Springer, 343–358.

[12] Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing program input grammars. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 95–110.

[13] Petra Berenbrink and Thomas Sauerwald. 2009. The weighted coupon collector’s problem and applications. In International Computing and Combinatorics Conference. Springer, 449–458.

[14] Donald A Berry and Bert Fristedt. 1985. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). London: Chapman and Hall 5, 7 (1985), 71–87.

[15] William Blair, Andrea Mambretti, Sajjad Arshad, Michael Weissbacher, William Robertson, Engin Kirda, and Manuel Egele. 2020. HotFuzz: Discovering Algorithmic Denial-of-Service Vulnerabilities Through Guided Micro-Fuzzing. In The Network and Distributed System Security Symposium (NDSS). 1–19.

[16] Tim Blazytko, Matt Bishop, Cornelius Aschermann, Justin Cappos, Moritz Schlögel, Nadia Korshun, Ali Abbasi, Marco Schweighauser, Sebastian Schinzel, Sergej Schumilo, et al. 2019. GRIMOIRE: Synthesizing Structure while Fuzzing. In 28th USENIX Security Symposium. 1985–2002.

[17] Marcel Böhme. 2018. STADS: Software testing as species discovery. ACM Transactions on Software Engineering and Methodology (TOSEM) 27, 2 (2018), 1–52.

[18] Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. 2020. Fuzzing: Challenges and Reflections. IEEE Software 38, 3 (2020), 79–86.

[19] Marcel Böhme and Brandon Falk. 2020. Fuzzing: On the exponential cost of vulnerability discovery. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 713–724.

[20] Marcel Böhme, Danushka Liyanage, and Valentin Wüstholz. 2021. Estimating Residual Risk in Greybox Fuzzing. In ACM European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 230–241.

[21] Marcel Böhme, Valentin JM Manès, and Sang Kil Cha. 2020. Boosting fuzzer eficiency: An information theoretic perspective. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 678–689.

[22] Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury. 2017. Directed greybox fuzzing. In The ACM Conference on Computer and Communications Security (CCS). ACM, 2329–2344.

[23] Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2016. Coverage-based greybox fuzzing as markov chain. In The ACM Conference on Computer and Communications Security (CCS). ACM, 1032–1043.

[24] Luca Borzacchiello, Emilio Coppa, and Camil Demetrescu. 2021. Fuzzing Symbolic Expressions. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 711–722.

[25] Tegan Brennan, Seemanta Saha, and Tevfik Bultan. 2020. JVM fuzzing for JIT-induced side-channel detection. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1011–1023.

[26] Chad Brubaker, Suman Jana, Baishakhi Ray, Sarfraz Khurshid, and Vitaly Shmatikov. 2014. Using frankencerts for automated adversarial testing of certificate validation in SSL/TLS implementations. In IEEE Symposium on Security and Privacy (S&P). IEEE, 114–129.

[27] Yan Cai and WK Chan. 2012. MagicFuzzer: Scalable deadlock detection for large-scale applications. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 606–616.

[28] Yan Cai, Biyun Zhu, Ruijie Meng, Hao Yun, Liang He, Purui Su, and Bin Liang. 2019. Detecting concurrency memory corruption vulnerabilities. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 706–717.

[29] Sang Kil Cha, Maverick Woo, and David Brumley. 2015. Program-adaptive mutational fuzzing. In IEEE Symposium on Security and Privacy (S&P). IEEE, 725–741.

[30] Oliver Chang, Jonathan Metzman, Max Moroz, Martin Barbella, and Abhishek Arya. 2016. OSS-Fuzz: Continuous Fuzzing for Open Source Software. https://github.com/google/oss-fuzz [Online; accessed 19-January-2021].

[31] Anne Chao and Chun-Huo Chiu. 2016. Species richness: estimation and comparison. Wiley StatsRef: Statistics Reference Online (2016), 26.

[32] Anne Chao and Robert K Colwell. 2017. Thirty years of progeny from Chao’s inequality: Estimating and comparing richness with incidence data and incomplete sampling. SORT-Statistics and Operations Research Transactions (2017), 3–54.

[33] Anne Chao and Lou Jost. 2012. Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size. Ecology 93, 12 (2012), 2533–2547.

[34] Daming D Chen, Maverick Woo, David Brumley, and Manuel Egele. 2016. Towards Automated Dynamic Analysis for Linux-based Embedded Firmware.. In The Network and Distributed System Security Symposium (NDSS), Vol. 1. 1–16.

[35] Hongxu Chen, Shengjian Guo, Yinxing Xue, Yulei Sui, Cen Zhang, Yuekang Li, Haijun Wang, and Yang Liu. 2020. MUZZ: Thread-aware Grey-box Fuzzing for Effective Bug Hunting in Multithreaded Programs. In 29th USENIX Security Symposium. USENIX Association, 2325–2342.

[36] Jiongyi Chen, Wenrui Diao, Qingchuan Zhao, Chaoshun Zuo, Zhiqiang Lin, XiaoFeng Wang, Wing Cheong Lau, Menghan Sun, Ronghai Yang, and Kehuan Zhang. 2018. IoTFUZZER: Discovering Memory Corruptions in IoT Through App-based Fuzzing. In The Network and Distributed System Security Symposium (NDSS). 1–15.

[37] Peng Chen and Hao Chen. 2018. Angora: Eficient Fuzzing by Principled Search. In IEEE Symposium on Security and Privacy (S&P). 711–725.

[38] Peng Chen, Jianzhong Liu, and Hao Chen. 2019. Matryoshka: fuzzing deeply nested branches. In The ACM Conference on Computer and Communications Security (CCS). 499–513.

[39] Yuanliang Chen, Yu Jiang, Fuchen Ma, Jie Liang, Mingzhe Wang, Chijin Zhou, Xun Jiao, and Zhuo Su. 2019. EnFuzz: Ensemble Fuzzing with Seed Synchronization among Diverse Fuzzers. In 28th USENIX Security Symposium. USENIX Association, Santa Clara, CA, 1967–1983.

[40] Yaohui Chen, Peng Li, Jun Xu, Shengjian Guo, Rundong Zhou, Yulong Zhang, Long Lu, et al. 2020. SAVIOR: Towards Bug-Driven Hybrid Testing. In IEEE Symposium on Security and Privacy (S&P). IEEE Computer Society, Los Alamitos, CA, USA, 1580–1596.

[41] Yuqi Chen, Christopher M Poskitt, Jun Sun, Sridhar Adepu, and Fan Zhang. 2019. Learning-guided network fuzzing for testing cyber-physical system defences. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 962–973.

[42] Yuting Chen, Ting Su, Chengnian Sun, Zhendong Su, and Jianjun Zhao. 2016. Coverage-directed differential testing of JVM implementations. In proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 85–99.

[43] Yongheng Chen, Rui Zhong, Hong Hu, Hangfan Zhang, Yupeng Yang, Dinghao Wu, and Wenke Lee. 2021. One Engine to Fuzz’em All: Generic Language Processor Testing with Semantic Validation. In IEEE Symposium on Security and Privacy (S&P). 1–17.

[44] Wen Cheng, Wang Haijun, Li Yuekang, Qin Shengchao, Liu Yang, Xu Zhiwu, Chen Hongxu, Xie Xiaofei, Pu Geguang, and Liu Ting. 2020. MemLock: Memory Usage Guided Fuzzing. In ICSE’2020. 765–777.

[45] Siddhartha Chib and Edward Greenberg. 1995. Understanding the metropolis-hastings algorithm. The american statistician 49, 4 (1995), 327–335.

[46] Mingi Cho, Seoyoung Kim, and Taekyoung Kwon. 2019. Intriguer: Field-Level Constraint Solving for Hybrid Fuzzing. In The ACM Conference on Computer and Communications Security (CCS). 515–530.

[47] Jaeseung Choi, Joonun Jang, Choongwoo Han, and Sang Kil Cha. 2019. Grey-box concolic testing on binary code. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 736–747.

[48] Maria Christakis, Peter Müller, and Valentin Wüstholz. 2016. Guiding dynamic symbolic execution toward unverified program executions. In Proceedings of the 38th International Conference on Software Engineering. 144–155.

[49] Alessandro Cimatti, Alberto Griggio, Bastiaan Joost Schaafsma, and Roberto Sebastiani. 2013. The mathsat5 smt solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 93–107.

[50] Koen Claessen and John Hughes. 2000. QuickCheck: a lightweight tool for random testing of Haskell programs. In Proceedings of the fifth ACM SIGPLAN international conference on Functional programming. 268–279.

[51] James Clause, Wanchun Li, and Alessandro Orso. 2007. Dytan: a generic dynamic taint analysis framework. In Proceedings of the 2007 international symposium on Software testing and analysis. 196–206.

[52] Robert K Colwell, Anne Chao, Nicholas J Gotelli, Shang-Yi Lin, Chang Xuan Mao, Robin L Chazdon, and John T Longino. 2012. Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. Journal of plant ecology 5, 1 (2012), 3–21.

[53] Jake Corina, Aravind Machiry, Christopher Salls, Yan Shoshitaishvili, Shuang Hao, Christopher Kruegel, and Gio-vanni Vigna. 2017. Difuze: interface aware fuzzing for kernel drivers. In The ACM Conference on Computer and Communications Security (CCS). ACM, 2123–2138.

[54] Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An eficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.

[55] Joeri De Ruiter and Erik Poll. 2015. Protocol State Fuzzing of TLS Implementations. In 24th USENIX Security Symposium. 193–206.

[56] Kyle Dewey, Jared Roesch, and Ben Hardekopf. 2014. Language fuzzing using constraint logic programming. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 725–730.

[57] Kyle Dewey, Jared Roesch, and Ben Hardekopf. 2015. Fuzzing the Rust typechecker using CLP. In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 482–493.

[58] Sushant Dinesh, Nathan Burow, Dongyan Xu, and Mathias Payer. 2020. Retrowrite: Statically instrumenting cots binaries for fuzzing and sanitization. In 2020 IEEE Symposium on Security and Privacy (S&P). IEEE, 1497–1511.

[59] Sung Ta Dinh, Haehyun Cho, Kyle Martin, Adam Oest, Kyle Zeng, Alexandros Kapravelos, Gail-Joon Ahn, Tiffany Bao, Ruoyu Wang, Adam Doupé, et al. 2021. Favocado: Fuzzing the Binding Code of JavaScript Engines Using Semantically Correct Test Cases. In The Network and Distributed System Security Symposium (NDSS). 1–15.

[60] Marco Dorigo, Mauro Birattari, and Thomas Stutzle. 2006. Ant colony optimization. IEEE computational intelligence magazine 1, 4 (2006), 28–39.

[61] Adam Doupé, Ludovico Cavedon, Christopher Kruegel, and Giovanni Vigna. 2012. Enemy of the state: A state-aware black-box web vulnerability scanner. In USENIX Security Symposium. 523–538.

[62] Shawn Embleton, Sherri Sparks, and Ryan Cunningham. 2006. Sidewinder: An Evolutionary Guidance System for Malicious Input Crafting. Black Hat USA (2006).

[63] Xiaotao Feng, Ruoxi Sun, Xiaogang Zhu, Minhui Xue, Sheng Wen, Dongxi Liu, Surya Nepal, and Yang Xiang. 2021. Snipuzz: Black-box Fuzzing of IoT Firmware via Message Snippet Inference. The ACM Conference on Computer and Communications Security (CCS), 337–350.

[64] Paul Fiterau-Brostean, Bengt Jonsson, Robert Merget, Joeri de Ruiter, Konstantinos Sagonas, and Juraj Somorovsky. 2020. Analysis of DTLS Implementations Using Protocol State Fuzzing. In 29th USENIX Security Symposium. USENIX Association, Boston, MA, 2523–2540.

[65] Shuitao Gan, Chao Zhang, Peng Chen, Bodong Zhao, Xiaojun Qin, Dong Wu, and Zuoning Chen. 2020. GREYONE: Data Flow Sensitive Fuzzing. In USENIX Security Symposium. 2577–2594.

[66] Shuitao Gan, Chao Zhang, Xiaojun Qin, Xuwen Tu, Kang Li, Zhongyu Pei, and Zuoning Chen. 2018. CollAFL: Path sensitive fuzzing. In IEEE Symposium on Security and Privacy (S&P). IEEE, 679–696.

[67] Vijay Ganesh, Tim Leek, and Martin Rinard. 2009. Taint-based directed whitebox fuzzing. In Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 474–484.

[68] Xiang Gao, Ripon K Saha, Mukul R Prasad, and Abhik Roychoudhury. 2020. Fuzz testing based data augmentation to improve robustness of deep neural networks. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE, 1147–1158.

[69] Hugo Gascon, Christian Wressnegger, Fabian Yamaguchi, Daniel Arp, and Konrad Rieck. 2015. Pulsar: Stateful black-box fuzzing of proprietary network protocols. In International Conference on Security and Privacy in Communication Systems. Springer, 330–347.

[70] Patrice Godefroid, Bo-Yuan Huang, and Marina Polishchuk. 2020. Intelligent REST API data fuzzing. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 725–736.

[71] Patrice Godefroid, Adam Kiezun, and Michael Y Levin. 2008. Grammar-based whitebox fuzzing. In PLDI ’08 Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 206–215.

[72] Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 213–223.

[73] Patrice Godefroid, Michael Y Levin, David A Molnar, et al. 2008. Automated Whitebox Fuzz Testing.. In The Network and Distributed System Security Symposium (NDSS), Vol. 8. 151–166.

[74] Patrice Godefroid, Hila Peleg, and Rishabh Singh. 2017. Learn&fuzz: Machine learning for input fuzzing. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 50–59.

[75] Gustavo Grieco, Martín Ceresa, and Pablo Buiras. 2016. QuickFuzz: an automatic random fuzzer for common file formats. In Proceedings of the 9th International Symposium on Haskell. 13–20.

[76] Istvan Haller, Asia Slowinska, Matthias Neugschwandtner, and Herbert Bos. 2013. Dowsing for Overflows: A Guided Fuzzer to Find Buffer Boundary Violations. In USENIX Security Symposium. 49–64.

[77] HyungSeok Han and Sang Kil Cha. 2017. IMF: Inferred model-based fuzzer. In The ACM Conference on Computer and Communications Security (CCS). ACM, 2345–2358.

[78] HyungSeok Han, DongHyeon Oh, and Sang Kil Cha. 2019. CodeAlchemist: Semantics-Aware Code Generation to Find Vulnerabilities in JavaScript Engines. In The Network and Distributed System Security Symposium (NDSS). 1–15.

[79] Jingxuan He, Mislav Balunović, Nodar Ambroladze, Petar Tsankov, and Martin Vechev. 2019. Learning to Fuzz from Symbolic Execution with Application to Smart Contracts. In The ACM Conference on Computer and Communications Security (CCS). 531–548.

[80] Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with code fragments. In USENIX Security Symposium. 445–458.

[81] H. Huang, P. Yao, R. Wu, Q. Shi, and C. Zhang. 2020. Pangolin: Incremental Hybrid Fuzzing with Polyhedral Path Abstraction. In IEEE Symposium on Security and Privacy (S&P). IEEE Computer Society, Los Alamitos, CA, USA, 1144–1158.

[82] Kim Hyungsub, Ozmen Muslum Ozgur, Bianchi Antonio, Celik Z. Berkay, and Xu Dongyan. 2021. PGFUZZ: Policy-Guided Fuzzing for Robotic Vehicles. In The Network and Distributed System Security Symposium (NDSS). 1–18.

[83] KyriakosKIspoglou,DanielAustin,VishwathMohan,andMathiasPayer.2020. FuzzGen:AutomaticFuzzerGeneration. In USENIX Security Symposium. 2271–2287.

[84] Dae R Jeong, Kyungtae Kim, Basavesh Shivakumar, Byoungyoung Lee, and Insik Shin. 2019. Razzer: Finding Kernel Race Bugs through Fuzzing. In IEEE Symposium on Security and Privacy (S&P). IEEE S&P, 754–768.

[85] jfoote. 2020. The exploitable GDB plugin. https://github.com/jfoote/exploitable Accessed: 2020-02-07.

[86] Bo Jiang, Ye Liu, and WK Chan. 2018. Contractfuzzer: Fuzzing smart contracts for vulnerability detection. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 259–269.

[87] Zu-Ming Jiang, Jia-Ju Bai, Kangjie Lu, and Shi-Min Hu. 2020. Fuzzing Error Handling Code using Context-Sensitive Software Fault Injection. In 29th USENIX Security Symposium). USENIX Association, Boston, MA, 2595–2612.

[88] Wang Jinghan, Song Chengyu, and Heng Yin. 2021. Reinforcement Learning-based Hierarchical Seed Scheduling for Greybox Fuzzing. In The Network and Distributed System Security Symposium (NDSS). 1–17.

[89] William Johansson, Martin Svensson, Ulf E Larson, Magnus Almgren, and Vincenzo Gulisano. 2014. T-Fuzz: Model-based fuzzing for robustness testing of telecommunication protocols. In 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation. IEEE, 323–332.

[90] Pallavi Joshi, Chang-Seo Park, Koushik Sen, and Mayur Naik. 2009. A randomized dynamic program analysis technique for detecting real deadlocks. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation. 110–120.

[91] Jinho Jung, Stephen Tong, Hong Hu, Jungwon Lim, Yonghwi Jin, and Taesoo Kim. 2021. WINNIE: Fuzzing Windows Applications with Harness Synthesis and Fast Cloning. In The Network and Distributed System Security Symposium (NDSS). 1–17.

[92] Ulf Kargén and Nahid Shahmehri. 2015. Turning programs against each other: high coverage fuzz-testing using binary-code mutation and dynamic slicing. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 782–792.

[93] John G Kemeny and J Laurie Snell. 1976. Markov chains. Vol. 6. Springer-Verlag, New York.

[94] James Kennedy and Russell Eberhart. 1995. Particle swarm optimization. In Proceedings of ICNN’95-international conference on neural networks, Vol. 4. IEEE, 1942–1948.

[95] Kyungtae Kim, Dae R Jeong, Chung Hwan Kim, Yeongjin Jang, Insik Shin, and Byoungyoung Lee. 2020. HFL: Hybrid Fuzzing on the Linux Kernel. In The Network and Distributed System Security Symposium (NDSS). 1–17.

[96] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2123–2138.

[97] Zhifeng Lai, Shing-Chi Cheung, and Wing Kwong Chan. 2010. Detecting atomic-set serializability violations in multithreaded programs through active randomized testing. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 235–244.

[98] Gwangmu Lee, Woochul Shim, and Byoungyoung Lee. 2021. Constraint-guided Directed Greybox Fuzzing. In 30th USENIX Security Symposium. 1–18.

[99] Suyoung Lee, HyungSeok Han, Sang Kil Cha, and Sooel Son. 2020. Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer. In 29th USENIX Security Symposium. USENIX Association, Boston, MA, 1–18.

[100] Seungsoo Lee, Changhoon Yoon, Chanhee Lee, Seungwon Shin, Vinod Yegneswaran, and Phillip A Porras. 2017. DELTA: A Security Assessment Framework for Software-Defined Networks.. In The Network and Distributed System Security Symposium (NDSS). 1–15.

[101] Caroline Lemieux and Koushik Sen. 2018. Fairfuzz: A targeted mutation strategy for increasing greybox fuzz testing coverage. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 475–485.

[102] Jun Li, Bodong Zhao, and Chao Zhang. 2018. Fuzzing: a survey. Cybersecurity 1, 1 (2018), 1–13.

[103] Yuekang Li, Bihuan Chen, Mahinthan Chandramohan, Shang-Wei Lin, Yang Liu, and Alwen Tiu. 2017. Steelix: program-state based binary fuzzing. In ESEC/ FSE’ 17. ACM, PADERBORN, GERMANY, 627–637.

[104] Yuwei Li, Shouling Ji, Yuan Chen, Sizhuang Liang, Wei-Han Lee, Yueyao Chen, Chenyang Lyu, Chunming Wu, Raheem Beyah, Peng Cheng, et al. 2021. UniFuzz: A holistic and pragmatic metrics-driven platform for evaluating fuzzers. In 30th USENIX Security Symposium. 1–18.

[105] Yuekang Li, Yinxing Xue, Hongxu Chen, Xiuheng Wu, Cen Zhang, Xiaofei Xie, Haijun Wang, and Yang Liu. 2019. Cerebro: context-aware adaptive fuzzing for effective vulnerability detection. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 533–544.

[106] Chieh-Jan Mike Liang, Nicholas D Lane, Niels Brouwers, Li Zhang, Börje F Karlsson, Hao Liu, Yan Liu, Jun Tang, Xiang Shan, Ranveer Chandra, et al. 2014. Caiipa: Automated large-scale mobile app testing through contextual fuzzing. In Proceedings of the 20th annual international conference on Mobile computing and networking. 519–530.

[107] Hongliang Liang, Xiaoxiao Pei, Xiaodong Jia, Wuwei Shen, and Jian Zhang. 2018. Fuzzing: State of the art. IEEE Transactions on Reliability 67, 3 (2018), 1199–1218.

[108] Daniel Liew, Cristian Cadar, Alastair F Donaldson, and J Ryan Stinnett. 2019. Just fuzz it: solving floating-point constraints using coverage-guided fuzzing. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 521–532.

[109] Guanjun Lin, Sheng Wen, Qing-Long Han, Jun Zhang, and Yang Xiang. 2020. Software vulnerability detection using deep neural networks: a survey. Proc. IEEE 108, 10 (2020), 1825–1848.

[110] Baozheng Liu, Chao Zhang, Guang Gong, Yishun Zeng, Haifeng Ruan, and Jianwei Zhuge. 2020. FANS: Fuzzing Android Native System Services via Automated Interface Analysis. In 29th USENIX Security Symposium. USENIX Association, Boston, MA, 307–323.

[111] Peng Liu, Xiangyu Zhang, Marco Pistoia, Yunhui Zheng, Manoel Marques, and Lingfei Zeng. 2017. Automatic text input generation for mobile testing. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 643–653.

[112] Xiao Liu, Xiaoting Li, Rupesh Prajapati, and Dinghao Wu. 2019. Deepfuzz: Automatic generation of syntax valid c programs for fuzz testing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 1044–1051.

[113] LLVM. 2021. The LLVM Compiler Infrastructure. https://llvm.org/ Accessed: March 2021.

[114] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. 2008. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems. 329–339.

[115] Weisi Luo, Dong Chai, Xiaoyue Run, Jiang Wang, Chunrong Fang, and Zhenyu Chen. 2021. Graph-based Fuzz Testing for Deep Learning Inference Engines. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 288–299.

[116] Chenyang Lyu, Shouling Ji, Chao Zhang, Yuwei Li, Wei-Han Lee, and Yu Song. 2019. MOPT: Optimized Mutation Scheduling for Fuzzers. In 28th USENIX Security Symposium. USENIX Association, Santa Clara, CA, 1949–1966.

[117] Lei Ma, Cyrille Artho, Cheng Zhang, Hiroyuki Sato, Johannes Gmeiner, and Rudolf Ramler. 2015. GRT: Program-analysis-guided random testing (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 212–223.

[118] Valentin JM Manès, Soomin Kim, and Sang Kil Cha. 2020. Ankou: guiding grey-box fuzzing towards combinatorial difference. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1024–1036.

[119] Valentin Jean Marie Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J Schwartz, and Maverick Woo. 2019. The art, science, and engineering of fuzzing: A survey. IEEE Transactions on Software Engineering (2019), 1–21.

[120] Muhammad Numair Mansur, Maria Christakis, Valentin Wüstholz, and Fuyuan Zhang. 2020. Detecting critical bugs in SMT solvers using blackbox mutational fuzzing. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 701–712.

[121] Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated testing for android applications. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 94–105.

[122] Paul Dan Marinescu and Cristian Cadar. 2013. KATCH: High-coverage testing of software patches. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. 235–245.

[123] Björn Mathis, Vitalii Avdiienko, Ezekiel O Soremekun, Marcel Böhme, and Andreas Zeller. 2017. Detecting information flow by mutating input data. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 263–273.

[124] Xiaozhu Meng and Barton P Miller. 2016. Binary code is not easy. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 24–35.

[125] Nicholas Metropolis and Stanislaw Ulam. 1949. The monte carlo method. Journal of the American statistical association 44, 247 (1949), 335–341.

[126] Jonathan Metzman, Abhishek Arya, and Laszlo Szekeres. 2020. FuzzBench: Fuzzer Benchmarking as a Service. https://security.googleblog.com/2020/03/fuzzbench-fuzzer-benchmarking-as-service.html

[127] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association. 1045–1048.

[128] Barton Miller and Jeff Hollingsworth. 2019. Dyninst: An API for program binary analysis and instrumentation. https://dyninst.org/dyninst Accessed: 2019-06.

[129] Barton P Miller, Louis Fredriksen, and Bryan So. 1990. An empirical study of the reliability of UNIX utilities. Commun. ACM 33, 12 (1990), 32–44.

[130] Charlie Miller. 2008. Fuzz By Number – More Data About Fuzzing Than You Ever Wanted To Know. In CanSecWest. [131] MITRE. 2020. 2020 CWE Top 25 Most Dangerous Software Weaknesses. https://cwe.mitre.org/top25/archive/2020/

2020_cwe_top25.html

[132] David Molnar, Xue Cong Li, and David A Wagner. 2009. Dynamic Test Generation to Find Integer Bugs in x86 Binary Linux Programs.. In USENIX Security Symposium, Vol. 9. 67–82.

[133] Stefan Nagy and Matthew Hicks. 2019. Full-speed Fuzzing: Reducing Fuzzing Overhead through Coverage-guided Tracing. In IEEE Symposium on Security and Privacy (S&P). 787–802.

[134] Stefan Nagy, Anh Nguyen-Tuong, Jason D Hiser, Jack W Davidson, and Matthew Hicks. 2021. Breaking Through Binaries: Compiler-quality Instrumentation for Better Binary-only Fuzzing. In 30th USENIX Security Symposium. 1–19.

[135] James Newsome and Dawn Xiaodong Song. 2005. Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software. In The Network and Distributed System Security Symposium (NDSS), Vol. 5. Citeseer, 3–4.

[136] Hoang Lam Nguyen, Nebras Nassar, Timo Kehrer, and Lars Grunske. 2020. MoFuzz: A Fuzzer Suite for Testing Model-Driven Software Engineering Tools. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1103–1115.

[137] Tai D Nguyen, Long H Pham, Jun Sun, Yun Lin, and Quang Tran Minh. 2020. sFuzz: An eficient adaptive fuzzer for solidity smart contracts. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 778–788.

[138] Shirin Nilizadeh, Yannic Noller, and Corina S Pasareanu. 2019. DifFuzz: differential fuzzing for side-channel analysis. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 176–187.

[139] Oleksii Oleksenko, Bohdan Trach, Mark Silberstein, and Christof Fetzer. 2020. SpecFuzz: Bringing Spectre-type vulnerabilities to the surface. In 29th USENIX Security Symposium. USENIX Association, Boston, MA, 1481–1498.

[140] Sebastian Österlund, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2020. ParmeSan: Sanitizer-guided Greybox Fuzzing. In 29th USENIX Security Symposium. USENIX Association, Boston, MA, 2289–2306.

[141] Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. 2007. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 75–84.

[142] Shankara Pailoor, Andrew Aday, and Suman Jana. 2018. Moonshine: Optimizing OS fuzzer seed selection with trace distillation. In 27th USENIX Security Symposium. 729–743.

[143] Jianfeng Pan, Guanglu Yan, and Xiaocao Fan. 2017. Digtool: A virtualization-based framework for detecting kernel vulnerabilities. In 26th USENIX Security Symposium. 149–165.

[144] Chang-Seo Park and Koushik Sen. 2008. Randomized active atomicity violation detection in concurrent programs. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. 135–145.

[145] S. Park, W. Xu, I. Yun, D. Jang, and T. Kim. 2020. Fuzzing JavaScript Engines with Aspect-Preserving Mutation. In IEEE Symposium on Security and Privacy (S&P). IEEE Computer Society, Los Alamitos, CA, USA, 1211–1225.

[146] Terence Parr. [n.d.]. ANTLR: ANother Tool for Language Recognition. https://www.antlr.org/ Accessed: 2021-01. [147] Peachtech. 2021. PEACH: THE PEACH FUZZER PLATFORM. https://www.peach.tech/products/peach-fuzzer/

Accessed: 2021-01.

[148] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles. 1–18.

[149] Hui Peng and Mathias Payer. 2020. USBFuzz: A Framework for Fuzzing USB Drivers by Device Emulation. In 29th USENIX Security Symposium. USENIX Association, 2559–2575.

[150] HuiPeng,YanShoshitaishvili,andMathiasPayer.2018. T-Fuzz:fuzzingbyprogramtransformation.InIEEESymposium on Security and Privacy (S&P). IEEE, 697–710.

[151] Theofilos Petsios, Adrian Tang, Salvatore Stolfo, Angelos D Keromytis, and Suman Jana. 2017. NEZHA: Eficient domain-independent differential testing. In IEEE Symposium on Security and Privacy (S&P). IEEE, 615–632.

[152] Theofilos Petsios, Jason Zhao, Angelos D Keromytis, and Suman Jana. 2017. Slowfuzz: Automated domain-independent detection of algorithmic complexity vulnerabilities. In The ACM Conference on Computer and Communications Security (CCS). ACM, 2155–2168.

[153] Van-Thuan Pham, Marcel Böhme, and Abhik Roychoudhury. 2016. Model-based whitebox fuzzing for program binaries. In Automated Software Engineering (ASE), 2016 31st IEEE/ACM International Conference on. IEEE, 543–553.

[154] Van-Thuan Pham, Marcel Böhme, and Abhik Roychoudhury. 2020. AFLNET: A Greybox Fuzzer for Network Protocols. In IEEE International Conference on Software Testing, Verification and Validation (ICST) 2020. 1–6.

[155] Van-Thuan Pham, Wei Boon Ng, Konstantin Rubinov, and Abhik Roychoudhury. 2015. Hercules: Reproducing crashes in real-world application binaries. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 891–901.

[156] Rasmus V Rasmussen and Michael A Trick. 2008. Round robin scheduling–a survey. European Journal of Operational Research 188, 3 (2008), 617–636.

[157] Sanjay Rawat, Vivek Jain, Ashish Kumar, Lucian Cojocar, Cristiano Giuffrida, and Herbert Bos. 2017. VUzzer: Application-aware evolutionary fuzzing. In The Network and Distributed System Security Symposium (NDSS). 1–15.

[158] Alexandre Rebert, Sang Kil Cha, Thanassis Avgerinos, Jonathan Foote, David Warren, Gustavo Grieco, and David Brumley. 2014. Optimizing seed selection for fuzzing. In 23rd USENIX Security Symposium. 861–875.

[159] Nilo Redini, Andrea Continella, Dipanjan Das, Giulio De Pasquale, Noah Spahn, Aravind Machiry, Antonio Bianchi, Christopher Kruegel, and Giovanni Vigna. 2021. DIANE: Identifying Fuzzing Triggers in Apps to Generate Under-constrained Inputs for IoT Devices. In IEEE Symposium on Security and Privacy (S&P). 1–17.

[160] Prateek Saxena, Steve Hanna, Pongsin Poosankam, and Dawn Song. 2010. FLAX: Systematic Discovery of Client-side Validation Vulnerabilities in Rich Web Applications.. In The Network and Distributed System Security Symposium (NDSS). 1–17.

[161] Fred B Schneider. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys (CSUR) 22, 4 (1990), 299–319.

[162] Sergej Schumilo, Cornelius Aschermann, Ali Abbasi, Simon Wörner, and Thorsten Holz. 2020. HYPER-CUBE: High-Dimensional Hypervisor Fuzzing. In The Network and Distributed System Security Symposium (NDSS). 1–16.

[163] Sergej Schumilo, Cornelius Aschermann, Ali Abbasi, Simon Wörner, and Thorsten Holz. 2021. NYX: Greybox Hypervisor Fuzzing using Fast Snapshots and Afine Types. In 30th USENIX Security Symposium. 1–18.

[164] SergejSchumilo,CorneliusAschermann,RobertGawlik,SebastianSchinzel,andThorstenHolz.2017. kAFL:Hardware-assisted feedback fuzzing for OS kernels. In USENIX Security Symposium. 167–182.

[165] Koushik Sen. 2007. Concolic testing. In Proceedings of the 22nd IEEE/ACM international conference on Automated software engineering. 571–572.

[166] Koushik Sen. 2007. Effective random testing of concurrent programs. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering. 323–332.

[167] Koushik Sen. 2008. Race directed random testing of concurrent programs. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 11–21.

[168] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. In USENIX Security Symposium. 309–318.

[169] Hossain Shahriar and Mohammad Zulkernine. 2012. Mitigating program security vulnerabilities: Approaches and challenges. ACM Computing Surveys (CSUR) 44, 3 (2012), 1–46.

[170] Claude E Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379–423.

[171] Dongdong She, Rahul Krishna, Lu Yan, Suman Jana, and Baishakhi Ray. 2020. MTFuzz: fuzzing with a multi-task neural network. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference andSymposium on the Foundations of Software Engineering. 737–749.

[172] Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, and Suman Jana. 2019. NEUZZ: Eficient Fuzzing with Neural Program Smoothing. In IEEE Symposium on Security and Privacy (S&P). 803–817.

[173] Juraj Somorovsky. 2016. Systematic fuzzing and testing of TLS libraries. In The ACM Conference on Computer and Communications Security (CCS). 1492–1504.

[174] Dokyung Song, Felicitas Hetzelt, Dipanjan Das, Chad Spensky, Yeoul Na, Stijn Volckaert, Giovanni Vigna, Christopher Kruegel, Jean-Pierre Seifert, and Michael Franz. 2019. Periscope: An effective probing and fuzzing framework for the hardware-os boundary. In The Network and Distributed System Security Symposium (NDSS). 1–15.

[175] Dokyung Song, Julian Lettner, Prabhu Rajasekaran, Yeoul Na, Stijn Volckaert, Per Larsen, and Michael Franz. 2019. SoK: Sanitizing for Security. In IEEE Symposium on Security and Privacy (S&P). 1275–1295.

[176] Suhwan Song, Chengyu Song, Yeongjin Jang, and Byoungyoung Lee. 2020. CrFuzz: fuzzing multi-purpose programs through input validation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 690–700.

[177] Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. 2016. Driller: Augmenting Fuzzing Through Selective Symbolic Execution. In The Network and Distributed System Security Symposium (NDSS). 1–16.

[178] Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. 2013. Sok: Eternal war in memory. In IEEE Symposium on Security and Privacy (S&P). 48–62.

[179] Dimitrios Tychalas, Hadjer Benkraouda, and Michail Maniatakos. 2021. ICSFuzz: Manipulating I/Os and Repurposing Binary Code to Enable Instrumented Fuzzing in ICS Control Applications. In 30th USENIX Security Symposium. 1–16.

[180] Spandan Veggalam, Sanjay Rawat, Istvan Haller, and Herbert Bos. 2016. IFuzzer: An evolutionary interpreter fuzzer using genetic programming. In European Symposium on Research in Computer Security. Springer, 581–601.

[181] Vasudev Vikram, Rohan Padhye, and Koushik Sen. 2021. Growing A Test Corpus with Bonsai Fuzzing. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 723–735.

[182] Martin Vuagnoux. 2005. Autodafé: An act of software torture. In Proceedings of the 22th Chaos Communication Congress. Chaos Computer Club, 47–58.

[183] Dmitry Vyukov. 2021. syzkaller. https://github.com/google/syzkaller Accessed: May 2021.

[184] Haijun Wang, Xiaofei Xie, Yi Li, Cheng Wen, Yuekang Li, Yang Liu, Shengchao Qin, Hongxu Chen, and Yulei Sui. 2020. Typestate-guided fuzzer for discovering use-after-free vulnerabilities. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE, 999–1010.

[185] Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2017. Skyfire: Data-driven seed generation for fuzzing. In IEEE Symposium on Security and Privacy (S&P). IEEE, 579–594.

[186] Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2019. Superion: Grammar-aware greybox fuzzing. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 724–735.

[187] Tielei Wang, Tao Wei, Guofei Gu, and Wei Zou. 2010. TaintScope: A checksum-aware directed fuzzing tool for automatic software vulnerability detection. In IEEE Symposium on Security and Privacy (S&P). IEEE, 497–512.

[188] Yanhao Wang, Xiangkun Jia, Yuwei Liu, Kyle Zeng, Tiffany Bao, Dinghao Wu, and Purui Su. 2020. Not All Coverage Measurements Are Equal: Fuzzing by Coverage Accounting for Input Prioritization. In The Network and Distributed System Security Symposium (NDSS). 1–17.

[189] Richard Wartell, Yan Zhou, Kevin W Hamlen, Murat Kantarcioglu, and Bhavani Thuraisingham. 2011. Differentiating code from data in x86 binaries. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 522–536.

[190] Jiayi Wei, Jia Chen, Yu Feng, Kostas Ferles, and Isil Dillig. 2018. Singularity: Pattern fuzzing for worst case complexity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 213–223.

[191] Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 1-3 (1987), 37–52.

[192] Maverick Woo, Sang Kil Cha, Samantha Gottlieb, and David Brumley. 2013. Scheduling black-box mutational fuzzing. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. 511–522.

[193] Valentin Wüstholz and Maria Christakis. 2020. Harvey: A Greybox Fuzzer for Smart Contracts. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 1398–1409.

[194] M. Xu, S. Kashyap, H. Zhao, and T. Kim. 2020. Krace: Data Race Fuzzing for Kernel File Systems. In IEEE Symposium on Security and Privacy (S&P). IEEE Computer Society, Los Alamitos, CA, USA, 1396–1413.

[195] Wen Xu, Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. Designing new operating primitives to improve fuzzing performance. In The ACM Conference on Computer and Communications Security (CCS). ACM, 2313–2328.

[196] Wen Xu, Hyungon Moon, Sanidhya Kashyap, Po-Ning Tseng, and Taesoo Kim. 2019. Fuzzing File Systems via Two-Dimensional Input Space Exploration. In IEEE Symposium on Security and Privacy (S&P). 818–834.

[197] Wen Xu, Soyeon Park, and Taesoo Kim. 2020. FREEDOM: Engineering a State-of-the-Art DOM Fuzzer. In The ACM Conference on Computer and Communications Security (CCS). 971–986.

[198] Shengbo Yan, Chenlu Wu, Hang Li, Wei Shao, and Chunfu Jia. 2020. PathAFL: Path-Coverage Assisted Fuzzing. In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 598–609.

[199] Dingning Yang, Yuqing Zhang, and Qixu Liu. 2012. Blendfuzz: A model-based framework for fuzz testing programs with grammatical inputs. In 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications. IEEE, 1070–1076.

[200] Wei You, Xuwei Liu, Shiqing Ma, David Perry, Xiangyu Zhang, and Bin Liang. 2019. SLF: Fuzzing without Valid Seed Inputs. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 712–723.

[201] WeiYou,XueqiangWang,ShiqingMa,JianjunHuang,XiangyuZhang,XiaoFengWang,andBinLiang.2019. ProFuzzer: On-the-fly Input Type Probing for Better Zero-Day Vulnerability Discovery. In IEEE Symposium on Security and Privacy (S&P). IEEE, 769–786.

[202] Wei You, Peiyuan Zong, Kai Chen, XiaoFeng Wang, Xiaojing Liao, Pan Bian, and Bin Liang. 2017. SemFuzz: Semantics-based Automatic Generation of Proof-of-Concept Exploits. In The ACM Conference on Computer and Communications Security (CCS). ACM, 2139–2154.

[203] Tai Yue, Pengfei Wang, Yong Tang, Enze Wang, Bo Yu, Kai Lu, and Xu Zhou. 2020. EcoFuzz: Adaptive Energy-Saving Greybox Fuzzing as a Variant of the Adversarial Multi-Armed Bandit. In 29th USENIX Security Symposium. USENIX Association, Boston, MA, 2307–2324.

[204] Insu Yun, Sangho Lee, Meng Xu, Yeongjin Jang, and Taesoo Kim. 2018. QSYM: A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing. In 27th USENIX Security Symposium. 745–761.

[205] Jonas Zaddach, Luca Bruno, Aurelien Francillon, Davide Balzarotti, et al. 2014. AVATAR: A Framework to Support Dynamic Security Analysis of Embedded Systems’ Firmwares.. In The Network and Distributed System Security Symposium (NDSS), Vol. 23. 1–16.

[206] Michal Zalewski. 2021. AFL (american fuzzy lop). https://github.com/google/AFL accessed 21-January-2021.

[207] Qian Zhang, Jiyuan Wang, Muhammad Ali Gulzar, Rohan Padhye, and Miryung Kim. 2020. BigFuzz: Eficient Fuzz Testing for Data Analytics Using Framework Abstraction. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 722–733.

[208] Yangyong Zhang, Lei Xu, Abner Mendoza, Guangliang Yang, Phakpoom Chinprutthiwong, and Guofei Gu. 2019. Life after Speech Recognition: Fuzzing Semantic Misinterpretation for Voice Assistant Applications.. In The Network and Distributed System Security Symposium (NDSS). 1–15.

[209] Zhuo Zhang, Wei You, Guanhong Tao, Yousra Aafer, Xuwei Liu, and Xiangyu Zhang. 2021. STOCHFUZZ: Sound and Cost-effective Fuzzing of Stripped Binaries by Incremental and Stochastic Rewriting. In IEEE Symposium on Security and Privacy (S&P). 1–18.

[210] Lei Zhao, Yue Duan, Heng Yin, and Jifeng Xuan. 2019. Send Hardest Problems My Way: Probabilistic Path Prioritization for Hybrid Fuzzing. In The Network and Distributed System Security Symposium (NDSS). San Diego, CA, 1–15.

[211] Yaowen Zheng, Ali Davanian, Heng Yin, Chengyu Song, Hongsong Zhu, and Limin Sun. 2019. FIRM-AFL: High-Throughput Greybox Fuzzing of IoT Firmware via Augmented Process Emulation. In 28th USENIX Security Symposium. USENIX Association, Santa Clara, CA, 1099–1114.

[212] Rui Zhong, Yongheng Chen, Hong Hu, Hangfan Zhang, Wenke Lee, and Dinghao Wu. 2020. SQUIRREL: Testing Database Management Systems with Language Validity and Coverage Feedback. In The ACM Conference on Computer and Communications Security (CCS). 955–970.

[213] Chijin Zhou, Mingzhe Wang, Jie Liang, Zhe Liu, and Yu Jiang. 2020. Zeror: Speed Up Fuzzing with Coverage-sensitive Tracing and Scheduling. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 858–870.

[214] Xiaogang Zhu and Marcel Böhme. 2021. Regression Greybox Fuzzing. In The ACM Conference on Computer and Communications Security (CCS). 2169–2182.

[215] Xiaogang Zhu, Xiaotao Feng, Tengyun Jiao, Sheng Wen, Yang Xiang, Seyit Camtepe, and Jingling Xue. 2019. A Feature-Oriented Corpus for Understanding, Evaluating and Improving Fuzz Testing. In Asia CCS ’19 Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security. AUCKLAND, NEW ZEALAND, 658–663.

[216] Xiaogang Zhu, Xiaotao Feng, Xiaozhu Meng, Sheng Wen, Seyit Camtepe, Yang Xiang, and Kui Ren. 2020. CSI-Fuzz: Full-speed Edge Tracing Using Coverage Sensitive Instrumentation. IEEE Transactions on Dependable and Secure Computing (2020), 1–12.

[217] Peiyuan Zong, Tao Lv, Dawei Wang, Zizhuang Deng, Ruigang Liang, and Kai Chen. 2020. FuzzGuard: Filtering out Unreachable Inputs in Directed Grey-box Fuzzing through Deep Learning. In USENIX Security Symposium. 2255–2269.

Snipuzz: Black-box Fuzzing of IoT Firmware via Message Snippet Inference

hyperlab@newsletter.paragraph.com (HyperLab) — Fri, 22 Jul 2022 18:50:11 GMT

ABSTRACT

The proliferation of Internet of Things (IoT) devices has made people’s lives more convenient, but it has also raised many security concerns. Due to the difficulty of obtaining and emulating IoT firmware, in the absence of internal execution information, blackbox fuzzing of IoT devices has become a viable option. However, existing black-box fuzzers cannot form effective mutation optimization mechanisms to guide their testing processes, mainly due to the lack of feedback. In addition, because of the prevalent use of various and non-standard communication message formats in IoT devices, it is difficult or even impossible to apply existing grammar-based fuzzing strategies. Therefore, an efficient fuzzing approach with syntax inference is required in the IoT fuzzing domain. To address these critical problems, we propose a novel automatic black-box fuzzing for IoT firmware, termed Snipuzz. Snipuzz runs as a client communicating with the devices and infers message snippets for mutation based on the responses. Each snippet refers to a block of consecutive bytes that reflect the approximate code coverage in fuzzing. This mutation strategy based on message snippets considerably narrows down the search space to change the probing messages. We compared Snipuzz with four state-of-theart IoT fuzzing approaches, i.e., IoTFuzzer, BooFuzz, Doona, and Nemesys. Snipuzz not only inherits the advantages of app-based fuzzing (e.g., IoTFuzzer), but also utilizes communication responses to perform efficient mutation. Furthermore, Snipuzz is lightweight as its execution does not rely on any prerequisite operations, such as reverse engineering of apps. We also evaluated Snipuzz on 20 popular real-world IoT devices. Our results show that Snipuzz could identify 5 zero-day vulnerabilities, and 3 of them could be exposed only by Snipuzz. All the newly discovered vulnerabilities have been confirmed by their vendors.

CCS CONCEPTS • Security and privacy → Software and application security. KEYWORDS Fuzzing, IoT Firmware, Mutation, and Vulnerabilities

ACM Reference Format: Xiaotao Feng, Ruoxi Sun, Xiaogang Zhu, Minhui Xue, Sheng Wen, Dongxi Liu, Surya Nepal, and Yang Xiang. 2021. Snipuzz: Black-box Fuzzing of IoT Firmware via Message Snippet Inference. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS ’21), November 15–19, 2021, Virtual Event, Republic of Korea. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3460120.3484543

1 INTRODUCTION

The Internet of Things (IoT) refers to the billions of physical devices around the world which are now connected to the Internet, all collecting and sharing data. As early as 2017, IoT devices have out numbered the world’s population [39], and by 2020, every person on this planet has four IoT devices on average [23]. While these devices enrich our lives and industries, unfortunately, they also introduce blind spots and security risks in the form of vulnerabilities. We take Mirai [25] as an example. Mirai is one of the most prominent types of IoT botnet malware. In 2016, Mirai took down widely-used websites in a distributed denial of service (DDoS) campaign consisting of thousands of compromised household IoT devices. In the case of Mirai, attackers exploited vulnerabilities to target IoT devices themselves and then weaponized the devices for larger campaigns or spreading malware to the network. In fact, attackers can also use vulnerable devices for lateral movement, allowing them to reach critical targets. For example, in the work-from-home scenarios during COVID-19, Trend Micro has reported that, introducing vulnerable IoT devices to the household will expose employees to malware and attacks that could slip into a company’s network [26]. Considering the ubiquity of IoT devices, we believe that these known security incidents and risky scenarios are nothing but a tip of the icebergIoT vulnerabilities are normally about the implementation flaws within a device’s firmware. To launch new products as soon as possible, developers always tend to use open-source components in firmware development without good update plans [1]. This sacrifices the security of IoT devices and exposes them to vulnerabilities that security teams cannot remedy quickly. Even if vendors plan to fix the vulnerabilities in their products, the over-the-air patching is usually infeasible because IoT devices do not have reliable network connectivity [16]. As a result, half of the IoT devices in the market were reported to have vulnerabilities [28]. It is hence crucial to discover such vulnerabilities and fix them before an attacker does. However, most IoT software security tests heavily rely on the assumption of device firmware availability. In many cases, manufacturers tend not to release their product firmware and that makes various dynamic analysis methods based on code analysis [7, 13, 15, 18, 32, 46] (or emulation [8, 10, 20, 50, 51]) difficult. Among the existing defense techniques, fuzz testing has shown promises to overcome these issues and has been widely used as an efficient approach in finding vulnerabilities. Moreover, the ability of IoT devices to communicate with the outside world offers us a new option, and that is to test device firmware through exchanging network messages. Therefore, an IoT fuzzer could be designed to send random communication messages to the target device in order to detect if it shows any symptoms of malfunctioning. Potential vulnerabilities could be exposed if crashes are triggered during execution or the device is pushed to send back abnormal messages. However, using network communication to fuzz the firmware of IoT devices is very challenging. Since obtaining internal execution information from the device is not possible, most existing network IoT fuzzers [9, 31, 44] work in a black-box manner. This makes optimizing the mutation strategies very difficult. Because the selection of mutated seeds is entirely random, existing black-box IoT fuzzing approaches could become very hard to handle, and sometimes, even become more like brute force crack testing. In addition, IoT devices have strict grammatical specifications for inputs in communication. Most of the messages that are generated by random mutation will break the syntax rules of the input, and will be quickly rejected during syntax validation in the firmware before being executed. A grammar-based mutation strategy [2, 40] can effectively generate messages that meet the input requirements though. This can be done by learning the syntax via documented grammatical specifications or from a labeled training set. However, as shown in Table 1, many non-standard IoT device communication formats are being used in practice. Therefore, preparing enough learning materials for grammar-based mutation strategies is a huge workload, which makes the deployment of grammar-based IoT fuzzing difficult.

Challenges.

In this paper, we focus on detecting vulnerabilities in IoT firmware by sending messages to IoT devices. To design an effective and efficient fuzzing method, several challenges have to be overcome.

• Challenge 1: Lack of a feedback mechanism. Without access to firmware, it is nearly impossible to obtain the internal execution information from IoT device to guide the fuzzing process(as is done in most typical fuzzers). Therefore, we need a lightweight solution to obtain feedback from device, and optimize the generation process.

• Challenge 2: Diverse message formats. Table 1 shows some message formats that are used in IoT communication, including JSON, SOAP, Key-value pairs, string, or even customized formats. In order to be applied to various devices, a solution should be able to infer the format from a raw message.

• Challenge 3: Randomness in responses. The response messages of an IoT device may contain random elements, such as timestamps or tokens. Such randomness results in different responses for the same message, and diminishes the effectiveness of fuzzing because the input generation of Snipuzz relies on responses.

Our approach. In this paper, we propose a novel and automatic black-box IoT fuzzing, named Snipuzz, to detect vulnerabilities in IoT firmware. Different from other existing IoT fuzzing approaches, Snipuzz implements a snippet-based mutation strategy which utilizes feedback from IoT devices to guide the fuzzing. Specifically, Snipuzz uses a novel heuristic algorithm to detect the role of each byte in the message. It will first mutate bytes in a message one by one to generate probe messages, and categorize the corresponding responses collected from device. Adjacent bytes that have the same role in the message form the initial message snippets, which is the basic unit of mutation. Moreover, Snipuzz utilizes a hierarchical clustering strategy to optimize mutation strategies and reduce the misclassification of categories caused by randomness in the response messages and the firmware’s internal mechanism. Therefore, Snipuzz, as a black-box fuzzer, can still effectively test the firmware of IoT devices without the support of grammatical rules and internal execution information of the device. Snipuzz resolves

Challenge 1

by using responses as the guidance to optimize the fuzzing process. Based on the responses, Snipuzz designs a novel heuristic algorithm to initially infer the role of each byte in the message, which resolves

Challenge 2.

Snipuzz utilizes edit distance [42] and agglomerative hierarchical clustering [43] to resolve

Challenge 3.

We summarize our main contributions as follows:

• Message snippet inference mechanism.

The responses from IoT devices are related to code execution path in firmware. Based on responses, we infer the relationship between message snippets and code execution path in firmware. This novel mutation mechanism enables that Snipuzz does not need any syntax rules to infer the hidden grammatical structure of the input through the device responses. Compared with the actual syntax rules that determine the input string format, the result of snippet determination proposed by Snipuzz has a similarity of 87.1%.

• More effective IoT fuzzing.

When testing IoT devices, the number of response categories is positively correlated with the number of code execution paths in the firmware. In the experiment, the number of response categories explored by Snipuzz far exceeded other methods on most devices, no matter how long the analysis duration was (in 10 minutes or 24 hours).

• Implementation and vulnerability findings.

We implemented the prototype of Snipuzz.2 We used it to test 20 real-world consumer-grade IoT devices while comparing with the state of-the-art fuzzing tools, i.e., IoTFuzzer, Doona, Boofuzz, and Nemesys. In 5 out of 20 devices, Snipuzz successfully found 5 zero-day vulnerabilities, including null pointer exceptions, denial of service, and unknown crashes, and 3 of them could be exposed only by Snipuzz.

2 BACKGROUND

2.1 Fuzz Testing

Fuzzing is a powerful automatic testing tool to detect software vulnerabilities. After decades of development, fuzzing has been widely used as a base in several security testing domains, such as the OS kernel [12, 36], servers [33], and the blockchain [3]. In general, fuzzing feeds the target programs with numerous mutated inputs and monitors exceptions (e.g., crashes). If an execution reveals undesired behavior, a vulnerability could be detected. To discover vulnerabilities more effectively, fuzzing algorithms optimize the mutation process based on feedback of executions (e.g., coverage knowledge), instead of using a purely random mutation strategy. Moreover, fuzzers can judge from the feedback mechanism whether each test case generated by seed mutation is “interesting” (i.e., whether the test case has explored unseen execution states). If a test case is interesting, it will be reserved as a new seed to participate in future mutation. With the feedback, many fuzzers [4, 5, 29, 41, 49] steer the computing resources towards the interesting test cases and achieve higher possibility to discover vulnerabilities.

2.2 Generic Communication Architecture of IoT Devices

To react with external inputs, most IoT devices implement a similar high-level communication architecture. As per the pseudo code example presented in Figure 1, a typical implementation of the communication architecture may consist of four parts: 1) Sanitizer, 2) Function Switch, 3) Function Definitions, and 4) Replier. When an IoT device receives an external input, Sanitizer starts parsing the input and performs regular matching. If the input format breaches the syntactic requirements, or an exception occurs during the parsing process, Sanitizer will directly notify Replier by sending a response message describing the input error and terminate the processing of input. If the input is syntactically correct, Function Switch transfers control to the corresponding Functions according to the attribute, Key, and corresponding value, val, extracted from the input. If Key cannot be matched, the processing of this input will be terminated, similarly as done by Replier. When Functions completes the processing, such as setFlow(), with the parameter val, it notifies Replier to generate the response message. Note that, the implementation of Functions is specific to IoT devices. As described above, Replier is responsible for sending responses to the client (such as the user’s APP). Based on the calling situation (indicated by the parameter code in the example), Replier determines the content of response message to be sent.

3 MOTIVATION

3.1 Response-Based Feedback Mechanism

The interactive capabilities of IoT devices make it possible to test security of device firmware through the network. However, there are also some challenges when testing IoT devices using network-based fuzzers. Since most network fuzzing methods cannot directly obtain execution status of the device, it is hard to establish an effective feedback mechanism to guide the fuzzing process. Without feedback mechanism, the fuzzing tests could be blind in the selection of mutation targets, and may lean to a brute force random test. As discussed previously, due to the lack of open-sourced firmware, it is difficult or even impossible to instrument the IoT devices. Therefore, the response messages returned by the firmware can be regarded as a valuable source of device status information at run-time. The Replier in Figure 1 will use the value of the variable code to determine the content of the response messages. The value of code comes from many different function blocks in the firmware. Parameters are passed when Sanitizer fails to parse the input or some exceptions are triggered; or when the Function Switch cannot match the key command characters in the input; or after each input is executed in the Functions. Therefore, through the content of the response message, the code block that has been executed in the firmware can be inferred. When the firmware source code is not available, the correspondence between the firmware execution and the response messages cannot be directly extracted. Moreover, the firmware may return the same response messages even executing different functions. Although the response message cannot be equated to the execution path of the device, it can still play an important role in the black-box fuzz testing for IoT devices. Although it is hard to link the code execution path corresponding to each response message, if the two inputs get different response messages, we can deduce that the two inputs go to different firmware code execution paths.

Our approach.

Snipuzz uses the response message to establish a new feedback mechanism. Snipuzz will collect every response, and when a new response is found, the input corresponding to the response will be queued as a seed for subsequent mutation testing.

Figure 1: Interaction with IoT Firmware. Most implementations of IoT devices have a similar communication architecture, including Sanitizer,Function Switch, Function Definitions, and Replier. If the Sanitizer and the Function Switch perform correctly, corresponding functionalities will be executed. Except for crashes, the Replier will always send responses to clients.

3.2 Message Snippet Inference

The firmware of the IoT device can be regarded as a software program with strict syntax requirements for input. If the byte-basedmutation strategies (such as mutating each byte in the input one by one or randomly selecting bytes for mutation testing) are used in the fuzz testing, the generated test cases could be rare to meet the input syntax requirements. The grammar-based fuzzers utilize detailed documents or a large training data set to learn the grammatical rules and use it to guide the generation of mutation [34, 40]. In many cases, the input syntax in IoT devices is diverse or nonstandard. Table 1 shows the communication format requirements used in 20 IoT devices from different vendors. Some of them are using well-known formats such as JSON and SOAP, but some use Key-value pairs or even custom strings as communication format. Therefore, it is difficult to provide grammar specifications or establish training data sets that cover communication formats on a large scale for the grammar-based mutation strategy. The best grammar guidance originates from the firmware itself. Responses from IoT devices suggest the execution results of messages. If we mutate a valid message byte by byte (i.e., breaching the format), we will get many different responses. If mutation of two different positions in the valid message receives the same response, these two positions have a high possibility that they are related to the same functionality in firmware. Therefore, those consecutive bytes with the same response can be merged into one snippet. This method of inferring message snippets can clearly reflect the utility of each byte after entering the firmware. In addition, mutation based on message snippets can largely reduce the search space and improve the efficiency of fuzzing.

Our approach.

Snipuzz merges consecutive bytes with the same response into one snippet. We also propose different mutation operators performing on snippets.

4 METHODOLOGY

In order to clearly present our approach, we first introduce some notations while explaining the fuzzing process of Snipuzz. At a high level, Snipuzz performs as a client which sends a message sequence to request certain actions from IoT devices. Any message ∈ ?requests the IoT device to perform a certain functionality, and all the messages Ð = work together to request an action (or actions). Similarly to the typical fuzzers, we initialize a seed with an initial message sequence, and a seed corpus with all the seeds (Section 4.1). Meanwhile, restoring message sequences are collected for resetting the IoT device to a predefined status. To establish an effective fuzzing, as depicted in Figure 2, Snipuzz first conducts a snippet determination process. Concretely, Snipuzz selects a message in a seed ⊂ , from which a probe message and a corresponding sequence will be generated. Each message in will trigger a response message (response for short) containing the information about the execution output. Snipuzz assigns each message a response pool , which is utilized to determine if a new response is unique. The uniqueness of a response indicates that it does not belong to any category of responses existed in the response pool. If is unique, Snipuzz will add into the pool , and reserve the corresponding message sequence as a new seed. Snipuzz then divides the message into different snippets based on the responses (Section 4.2). Upon the snippets are obtained, Snipuzz performs mutation according to various strategies, e.g., empty, bytes flip, data boundary, or havoc (detailed in Section 4.3). Throughout the fuzzing process, Snipuzz sets up a network monitor to detect crashes which may indicate vulnerabilities (Section 4.4).

4.1 Message Sequence Acquisition

The quality of initial seeds could influence the fuzzing campaigns significantly. Therefore, we consider to obtain high-quality initial seeds conforming to highly-structured formats required by IoT devices, as such inputs may exercise complex execution paths and enlarge the opportunity of exposing vulnerabilities at deep code. Generating seeds based on companion app reverse-engineering [9] or accessible specifications (as mentioned in Section 3.2) could be intuitive solutions. However, they either require heavy engineeringefforts or could be error-prone (e.g., seeds may violate the required formats or have the wrong order of messages).

Initial seed acquisition. Snipuzz proposes a lightweight solution to obtain initial valid seeds. Considering that many IoT devices have first- or third-party API documents as well as the test suites, the testing programs provided by both parties can effectively act as a client, sending control commands to IoT devices or remote servers. Most structural information (e.g., header, message content) and protocols (e.g., HTTP, HNAP, MQTT) of communication packets are defined in the API programs as message payloads. Therefore, Snipuzz leverages these test suites to communicate with the target devices, while at the same time, extracting the message sequences as initial seeds. For example, when using an API program to turn on a light bulb, the program first sends login information to the server or to the IoT device, then sends a message to locate a specific light bulb device, and finally sends a message to control the device to turn on the light. Snipuzz captures such a message sequence that triggers a functionality of IoTdevice as an initial seed. Restoring message sequence acquisition. In order to replay a test case for the crash triage, Snipuzz ensures that the device under test has the same initial state in each round of testing. After sending any message sequence to the device, Snipuzz will send a restoring message sequence to reset the device to a predefined status. Manual efforts. Although we try our best efforts to provide a lightweight fuzzer, Snipuzz still requires some manual efforts to obtain valid and usable initial seeds. First, we manually configure the programs from the test suites, such as setting up the IP address and the login information. Note that, we only need to configure these programs once per device. Second, to capture the message sequences dynamically, we need to manually define the specific format and protocol in the network trafic monitor. Finally, we filter out some message sequences that will mislead the fuzzing process. For instance, some API programs provide operations that canautomaticallyupdateorrestartthedevice.Theseoperationswill halt the device and thus no response will be sent back. This leads to false-positive crashes because we consider a no-response execution as a crash. The manual work costs roughly 5 manhours per deviceand is only required during the message sequence acquisition phase of Snipuzz.

Figure 2: Workflow of Snipuzz. With the valid message sequences (seeds), Snipuzz performs snippet determination on each individual message. Then, Snipuzz mutates snippet(s) to generate new message sequences. By monitoring the network traffic, Snipuzz determines a crash when no responses are received.

4.2 Snippet Determination

The key idea of Snipuzz is to optimize fuzzing process based on snippets determined by responses. Put differently, Snipuzz lever-ages snippet mutation to reduce the search space of inputs, while the snippets are automatically clustered via categorizing responses from IoT devices. The major challenge is to correctly understand the semantics of responses. For instance, due to the presence of timestamp, two semantically identical responses will be classified into different categories if utilizing a simple string comparison. Therefore, Snipuzz utilizes a heuristic algorithm and a hierarchical clustering approach to determine the snippets in each message.

4.2.1 Initial Determination. The essence of a message snippet is the consecutive bytes in a message that enables the firmware to execute a specific code segment. For experienced experts, it is not difficult to segment message snippets according to the semantic definition in official documents. However, for algorithms that lack such knowledge, it is essential to apply some automatic approaches to identify the meaning of each byte in the message.Snipuzz first uses a heuristic algorithm to roughly divide each message into initial snippets. The core idea of the heuristic algorithm is to generate probe messages 𝑝𝑖 by deleting a certain byte in the message 𝑚 (𝑚 ∈ 𝑠𝑒𝑒𝑑 𝑆). By categorizing the responses 𝑟𝑖 of each probe message, Snipuzz preliminarily determines the snippets in the message 𝑚.

For example, as shown in Table 2, to determine snippets in the

message 𝑚 = {"on":true}, Snipuzz generates probe messages by removing the bytes in 𝑚 one by one. When the first byte ‘{’ in 𝑚 is deleted, the corresponding probe message 𝑝1 is "on":true}. Similarly,when the second byte is deleted, the corresponding probe message𝑝2 is {on":true}. Therefore, the message 𝑚 with 11 bytes can generate 11 different probe messages (𝑝1 to 𝑝11). Snipuzz will send the11 corresponding message sequences (𝑀1 to 𝑀11) containing the probe messages to the device and collect responses.

Snipuzz then distinguishes the snippets in the message 𝑚 by categorizing the responses. Specifically, the consecutive bytes with the same corresponding response type are merged into the same snippet. According to the examples illustrated in Table 2, the Response 𝑟1, 𝑟2, and 𝑟5 are merged into one category that indicates an error in JSON syntax, while Response 𝑟3 and 𝑟4 are merged into anothercategory which indicates an error of an invalid input parameter.Therefore, the consecutive bytes whose corresponding responses belong to the same category can form a message snippet. Through this heuristic approach, Snipuzz can determine all initial snippets in the message 𝑚.

A naive method to categorize responses is to utilize a string comparison, i.e., comparing the content of responses byte by byte.However, due to the existence of randomness in responses (e.g.,timestamp and token), a simple string comparison may incorrectly distinguish the responses with same semantic meaning into different categories. Therefore, a more advanced solution, Edit Distance [42], is introduced to determine the category of responses.As shown in Equation (1), a similarity score, 𝑠𝑘𝑡, between two responses 𝑟𝑘 and 𝑟𝑡 is calculated.

where the 𝑚𝑎𝑥𝑙𝑒𝑛() in the equation selects the longer string between the two responses and the 𝑒𝑑𝑖𝑡𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 () counts the minimum number of operations, including insertion, deletion, and substitution, required to transform one string into the other. Therefore,the more similar two responses are, the larger the value of 𝑠𝑘𝑡 is.Snipuzz first calculates a self-similarity score 𝑠𝑖𝑖 for each probe message 𝑝𝑖 . Note that 𝑝𝑖 is generated by mutating the 𝑖-th byte in the message 𝑚. Concretely, Snipuzz sends the same probe message 𝑝𝑖 twice within an interval of one second. Two responses 𝑟𝑖 , 𝑟𝑖′ will be collected from the IoT device, correspondingly. The selfsimilarity score 𝑠𝑖𝑖 is then calculated based on the two responses 𝑟𝑖 , 𝑟𝑖′ according to Equation (1). Note that, due to the randomness in the responses, there could be differences between the two responses 𝑟𝑖 , 𝑟𝑖′ , even though they are from the same probe message. Therefore, the self-similarity score could be smaller than 1. To determine whether two responses belong to the same category, Snipuzz computes the similarity score of two responses and compares it with the self-similarity score. For example, for two responses 𝑟𝑖 and 𝑟𝑗 , Snipuzz uses the Equation (1) to compute the similarity score 𝑠𝑖𝑗 . After that, 𝑠𝑖𝑗 will be compared with the selfsimilarity. If 𝑠𝑖𝑗 >= 𝑠𝑖𝑖 or 𝑠𝑖𝑗 >= 𝑠𝑗 𝑗 satisfies, responses 𝑟𝑖 and𝑟𝑗 will be considered belonging to the same category; otherwise,responses 𝑟𝑖 and 𝑟𝑗 are then assigned to the different categories.For a newly received response 𝑟𝑖, Snipuzz will compare it with all the responses in the corresponding response pool 𝑅 based on the similarity score. If the new response 𝑟𝑖 does not belong to any existing category, the response 𝑟𝑖 as well as the corresponding probe message 𝑝𝑖 will be added into the Response Pool. With the response pool 𝑅, Snipuzz categories each byte in the message 𝑚. Specifically, the category of the 𝑖-th byte in message 𝑚 is assigned according to the category of response 𝑟𝑖 . Then the consecutive bytes with the same category will be merged into one snippet. Figure 3 shows an example of the initial snippet determination on the message 𝑚 = {"on":true} according to the response categories in Table 2.

Figure 3: An example of snippet determination.

4.2.2 Hierarchical Clustering. Although Snipuzz utilizes similarity comparison to mitigate the mis-categorization caused by randomness in responses, two semantically identical responses may still be mis-categorized into different categories. This could occur when the responses contain contents extracted or copied from probe messages. For example, due to the quotation of specific error contents from probe messages, the heuristic algorithm will not assign them to one category. Specifically, the similarity score 𝑠34 of 𝑚 = {"on":true} in Table 2 is 0.979, which is smaller than the self-similarity scores 𝑠33 = 1*.000 and 𝑠44 = 1.*000 (as there is no randomness in the responses). However, these two responses are semantically identical and should be identified into one category, i.e., they are both error messages, indicating parameter syntax errors are located in the probe messages and the device is executing the same code block. In order to solve the aforementioned problem, Snipuzz uses agglomerative hierarchical clusters to refine message snippets. The core idea of hierarchical clustering is to continuously merge the two most similar clusters until only one cluster remains. As shown in Algorithm 1, Snipuzz will initialize the snippets according to Initial Snippets determined in Section 4.2.1 (line 1). After that, each response category in the response pool will be initialized as a cluster (line 2). Snipuzz will convert the responses into feature vectors (line 3, detailed in the later paragraph) which will be used to compute the distance between each pair of clusters (lines 5-7). Then the two closest clusters will be merged and thecluster center will be updated accordingly (lines 8-10). After performing the cluster process, Snipuzz will generate new snippets according to the current cluster result and add the new snippets into the snippet segmentation result (line 11), which will be further used for mutation.

Concretely, Snipuzz first extracts features from responses, which vectorize responses into tuples of the self-similarity score, the length of the response, the number of alphabetic segments, the number of numeric segments, and the number of symbol segments. Each segment consists of consecutive bytes that have the same type. For instance, “123” is 1 numeric segment, and there are 2 alphabetic segments and 1 numeric segment in “𝑎1𝑏”. More specifically, the 𝑟1 in Table 2 will be vectorized to 𝑣1 = (1*,* 91*,* 10*,* 2*,* 10). Similarly, responses 𝑟3 and 𝑟4 will be converted to 𝑣2 = (1*,* 94*,* 11*,* 2*,* 13) and 𝑣3 = (1*,* 94*,* 11*,* 2*,* 13). Figure 4 shows an example of clustering according to the message 𝑚 = {"on":true} in Table 2. According to the Algorithm 1, in the preparation round (0th round) of clustering, each category in the response pool will be initialized a single cluster. In the 1st round, as clusters 2 and 3 are the two clusters with minimum distance (∥𝑣2 − 𝑣3 ∥ = 0), the two clusters are merged into a new cluster. Correspondingly, the message snippets ‘o’ and ‘n’ are merged into a new snippet, marked with index #4. Similarly, in the next round, the two closest clusters, the cluster 1 and the new cluster, are merged, and a new snippet will also be generated. Finally, all snippets in the message are merged into one new snippet, i.e., the message itself. All the new generated snippets together with the initial snippets will be used in message mutation in the next stage.

4.3 Mutation Schemes

Snippet Mutation. In order to conduct an efficient fuzzing, Snipuzz mutates the snippets obtained in the stage of Snippet Determination. Note that the mutation schemes are performed on the entire snippet instead of a single byte in a message.

• Empty. The empty of a data domain may crash the firmware if the data domain is not properly checked. Therefore, Snipuzz deletes an entire snippet to empty the data domain.

Figure 4: An example of hierarchical clustering.

• Byte Flip. To detect bugs in both the syntax parsers and the functional code, Snipuzz flips all bytes in a snippet. This changes the syntactic meaning of strings and will discover bugs when the parser does not properly check syntax. On the other hand, Byte Flip changes the values of data domains to examine firmware.

• Data Boundary. To detect the out-of-bound bugs that occur during assignment, Snipuzz modifies the values of numeric data to some boundary values (e.g., 65536).

• Dictionary. For the scheme of Dictionary, Snipuzz replaces a snippet with a pre-defined string such as “true” and “false”, which may directly explore more code coverage.

• Repeat. In order to detect bugs in syntax parsers, Snipuzz repeats a snippet for multiple times. Meanwhile, the repetition of data domain can detect defects caused by out-of-boundary problems. Havoc. The conditions for triggering bugs may be complicated.

For example, it may require modifying different data domains in the same message to trigger a bug. The aforementioned snippetmutation schemes only mutate one snippet at a time. However, the havoc mutation randomly selects some random snippets in a message, and performs the aforementioned mutation schemes on each of the selected snippets. Havoc mutation will not stop until finding a new response category or the target IoT device crashes.

4.4 Network Traffic Monitor

The network communication of the device is monitored and a timeout is set to determine whether the device has been crashed. In fact,the monitoring of device network communication is not a single step, and it occurs during the entire fuzzing process. In case of timeout, Snipuzz will continue to send the same message sequence for three times, as the cause of timeout could be network fluctuations instead of device crashes. If the timeout occurs for three times,Snipuzz will use the control command to physically restart the device and send the same sequence of messages to the device again. If the device still does not return the message on time, Snipuzz will record the crash and the corresponding message sequence.

4.5 Implementation

The design of Snipuzz consists of four steps: Message Sequence Acquisition, Snippet Determination, Mutation, and Network Communication Monitoring. In the Message Sequence Acquisition step, we use WireShark [45] in the program to detect and record the communication packets between the API and the IoT device, and manually cleaned these message sequences. The remaining core functional steps are packaged in a prototype implementedwith 4,000 lines of C# code. The network monitor will record every message sent to the device, and send the information to the device again when the device does not reply. A smart plug was used to implement the physical restart function of the target device. When Snipuzz needs to physically restart the device under test, it will send control messages to the smart plug, and the plug will be closed and then opened. In this way, the device under test will be powered off briefly and restarted.

5 EXPERIMENTAL EVALUATION

5.1 Experiment Setup

Environment setup. To initialize IoT devices, we use the applications provided by the manufacturers to complete the pairing. In order to better monitor the network communication, all devices under test are connected to a local router. Our automatic packet extractor and Snipuzz run on a Windows 10 desktop PC with Intel Core i7 six-core x 3.70 GHz CPU and 16 GB RAM, which is also connected to the router. IoT Devices under test. We have selected 20 popular consumer IoT devices from both online and offline markets worldwide, covering various well-known brands, such as Philips, Xiaomi, TP-Link, and Netgear. The types of selected IoT devices include smart plugs, smart bulbs, routers, home bridge, IP camera, and fingerprint terminal. These devices are either recommended by Amazon or the best-selling products available in supermarkets. Table 1 details the information of the IoT devices under test. Benchmark tools. In order to verify Snipuzz’s performance in finding crashes and message segmentation, we used seven different fuzzing schemes as benchmarks.

• IoTFuzzer [9]. The core idea of IotFuzzer is to find the functions that send control commands to the IoT device by static analysis on companion apps, and to mutate the value of specific variables to perform fuzzing test without breaking the message format. Note that we try our best efforts to replicate IoTFuzzer since its source code is not publicly available, and we acknowledge that this could provide slightly different results with respect to the original version. We implement the IoTFuzzer by replacing the mutation algorithm in Snipuzz framework with the mutation strategies in IoTFuzzer. Considering that the purpose of companion apps analysis in IoTFuzzer is to ensure that only the data domain in the communication message is mutated, to make the benchmark as fair as possible, we use seeds same as the ones used in Snipuzz and manually segment the data domain of each seed message before feeding it to IoTFuzzer. We believe that such manual segmentation is sufficient to provide an upper bound performance of IoTFuzzer. Note that we remove the methods that are related to the feedback mechanism and snippet segmentation because these methods are not used in IoTFuzzer.

• Nemesys [22]. Nemesys is a protocol reverse engineering tool for network message analysis. It utilizes the distribution of value changes in a single message to infer the boundaries of each data domain. Considering that Nemesys is a protocol inference method instead of an off-the-shelf fuzzing tool, we implement the methodof Nemesys based on the Snipuzz framework to infer the snippet boundary, replacing corresponding snippet determination method (Section 4.2).

• BooFuzz [31]. As a successor of Sulley [19], BooFuzz is an excellent network protocol fuzzer that has been involved in several recent fuzzing research [9, 37, 48]. Different from other automatic fuzzers, BooFuzz requires human-guided message segmentation strategies as inputs. In our research, we leverage this property and manually define more fuzzing strategies to enrich the benchmark evaluation.

– BooFuzz-Default. In this strategy, we set each message in the input as a complete string, that is, BooFuzz will use the message as a string for mutation testing.

– BooFuzz-Byte. Each byte of the message in the input will be used for a mutation test individually.

– BooFuzz-Reversal. Contrary to the idea of IoTFuzzer, in this strategy, we focus on the mutation of non-data domain in the message, while keeping data domain unchanged.

• Doona [44]. Doona is a fork of the Bruterforce Exploit Detector (BED) [6], which is designed to detect potential vulnerabilities related to buffer and formats in network protocol. Different from other tools, Doona does not take network communication packets as seeds. The test cases of Doona are required to be pre-defined for each device or protocol under test.

• Snipuzz-NoSnippet. Snipuzz uses the segmentation of message snippets to enhance the fuzzing efficiency and the ability to find crashes. In order to verify whether the snippet determination indeed benefits fuzzing, we implement Snipuzz-NoSnippet based on Snipuzz. Snipuzz-NoSnippet does not have the snippet determination component, and blindly mutates bytes in messages without the knowledge of responses.

Except for Doona whose test cases are preset, all benchmark tools and Snipuzz are tested with same input sets. These input sets may be different in formats (e.g., BooFuzz requires to manually set the input, and Numesys requires the input to in pcap file format), but the content is the same. There are many other popular fuzzing tools which are able to test IoT devices via network communication, such as Peach [30] and AFLNET [33]. However, since they are grey-box fuzzers that requires to instrument firmware, it is infeasible and unfair to regard those tools as baselines for black-box schemes.

5.2 Vulnerability Identification

5.2.1 Snipuzz. After performing fuzz testing using Snipuzz on each of the 20 IoT devices for 24 hours, we detected 13 crashes in 5 devices as shown in Table 3. We further manually verified that the detected crashes include 7 null pointer dereferences, 1 denial of service, and 5 unknown crashes. The crashes found by Snipuzz are triggered by the malformed inputs. These malformed inputs break the message format in different ways. For example, deleting placeholders, emptying the data domain, or fortunately changing the type of data value.Note that all the crashes identified by Snipuzz are on JSON based devices, although we successfully conducted experiments on the 20 IoT devices with various communication formats, such as JSON, SOAP, and K-V pair. The experiments also show that Snipuzz discovers a higher number of response categories compared to the other fuzzers (as detailed in Section 5.3).

Null pointer dereferences. As shown in Table 3, the 7 crashes triggered by Snipuzz in TP-Link HS110 and HS100 are all caused by null pointer dereferences. After sending the test cases to HS110 and HS100, the devices crashed, unable to reply to any interaction. However, after a few minutes, the devices automatically restarted and recovered to the initial state. Based on the analysis of test cases, we found that the vulnerabilities are all triggered by messages that mutated in JSON syntax. Concretely, when some important place holders, such as curly braces and colons, or a part of the test message are mutated, the syntax structure and the semantic meaning of the message are broken. If the device cannot handle the mutated input message properly, it will crash the device. We reported the vulnerabilities to the device vendor, TP-Link, via email on June 13, 2020. They have confirmed the vulnerability and promised to fix it through a firmware update. Denial of service. Another interesting finding is the denial of service vulnerability detected in Philips A60 smart bulb. After being tested by Snipuzz for 24 hours, Philips’ official companion app could not manage the device normally. Specifically, the device cannot be found in the app and if any further messages are sent through the app, the response in the app will keep asking to bound the device to a device group and no further interaction is available. However, we observe that if the message packet is sent directly to the device, the device can work normally. This indicates that the device does not completely crash but its service via the companion app is denied. Unknown crashes. Snipuzz found 5 crashes on Yeelight bulbs, YLDP05YL, and YLDP13YL. The devices crashed and then restarted by themselves within roughly 1 minute. By analyzing the test cases, we found that the crashes are due to the deletion of certain data domains, such as the nullify of parameters (marked as red in Table 4). As the firmware of the two devices is not publicly available, the root cause of the vulnerability cannot be determined; however, we can still deduce that the vulnerability is due to the device reading in null values during the parsing process, causing a crash during the assignment. We also find that its communication using a local network does not require any authentication, which means that the device can be crashed by any attackers in the local network. Therefore, we consider the vulnerabilities as ‘remotely exploitable’.

5.2.2 Benchmark with state-of-the-art tools. As shown in Table 3,after 24 hours fuzz testing on each device, none of the benchmark tools found a crash, except IoTFuzzer. They did not find the crash due to various reasons. Donna focuses more on the mutation of communication protocols. Further, Donna cannot be applied on all devices, which limits its capacity. Since Boofuzz directly replaces the specified positions in the message with a preset string, it can only trigger limited types of vulnerabilities. Nemesys offers a new idea of determining message snippets. However, since it determines message snippets by the distribution of values in messages, it is difficult for Nemesys to accurately decide the boundary between data and non-data domains. Therefore, Nemesys can hardly detect vulnerabilities that can only be triggered by mutating the data or non-data domains. Snipuzz-NoSnippet, which does not apply the snippet-based mutation method used in Snipuzz, is similar to the classic fuzzer AFL [24]. Since Snipuzz-NoSnippet does not infer the structure of the message but directly uses single or multiple consecutive bytes as the unit of mutation, most test cases generated by Snipuzz-NoSnippet destroy the structure of the messages. Such a method is difficult to work on devices that require highly-structured inputs.

IoTFuzzer detected 2 crashes in 2 smart bulb devices, i.e., the YLDP05Y and YLDP013Y. Due to the mutation strategy of IoTFuzzer, the malformed input provided by IoTFuzzer is obtained by emptying the data domain. According to the examples of mutatedmessages listed in Table 4, we can see that the messages mutated by IoTFuzzer resemble the ones generated by Snipuzz. The mutated domains of messages from Snipuzz and IoTFuzzer in Table 4 are all in the data domain. In terms of the mutation test effectiveness, Snipuzz and IoTFuzzer achieve the same goal on these two examples. However, Snipuzz can cover the mutation space of IoTFuzzer because IoTFuzzer only focuses on the data domain mutation, while Snipuzz can mutate both the data and non-data domains.

To further determine the root cause of the crash, we obtained the firmware source code of HS100 and HS110, two typical market consumer-grade smart plugs manufactured by TP-Link, and conducted a case study which reflected the differences between Snipuzz and IoTFuzzer. We found that one of the crashes triggered by Snipuzz on the two devices is caused by breaking the syntax structure and mutating both on data and non-data domains. More specifically, the mutated messages successfully bypassed the sanitizer and triggered the crash during function execution. We deduce that this could be caused by an error-prone third-party sanitizer (more details could be found in Section 5.5). On the other hand, due to the design of IoTFuzzer, the fuzzing is based on the grammatical rules as the IoTFuzzer tends to satisfy the grammar requirements with first-priority, in order not to be rejected by the sanitizer and ensure that each test case can reach the functional execution part in the firmware. Such strategy constraints the test range of fuzzing and its capacity to cover the sanitization part in comparison to Snipuzz. Therefore, we argue that, considering the complexity of IoT firmware testing, a lightweight and effective black-box vulnerability detection tool, such as Snipuzz, is a pressing need.

5.3 Runtime Performance

Figure 5 shows how Snipuzz and the other 7 benchmark fuzzers explored the device firmware during the first 10 minutes. We repeated the fuzz testing for 10 times and recorded the median values of the numbers of response categories discovered by each method. We manually reviewed the response categories to remove the mis-categorization caused by randomness in responses or the response mechanism of devices.

As shown in Figure 5, Doona can only detect a small number of response categories. Doona is protocol-based fuzzing methods, and its tests are more biased towards protocol content. The mutation test on the communication protocol has a high probability of being directly rejected or ignored by the device unilaterally, resulting in few categories of responses that can be pass the sanitizer. We implemented 3 fuzzing strategies based on Boofuzz, i.e., mutating the whole message as a string, mutating each byte of the message, and mutating non-data domain. However, the testing results indicate that all of them explored very limited categories of responses on each device. The limitation of category coverage is due to the mutation strategy of Boofuzz, which replaces the target contents with a specific pre-defined string. For example, using strings, such as “/./././././././.”, to replace the content of messages with different strategies (e.g., replacing the entire strings, a single byte, or a non-data domain) may cause the violation of message format and could be easily rejected by the sanitizer. Therefore, most of the responses obtained by Boofuzz fall into the category of “error responses”.

The number of response categories explored by IotFuzzer grows rapidly within a short period of time and then stagnates. In the mutation stage, IotFuzzer randomly selects a set of inputs from the original candidate inputs and randomly mutates the data domain for one or more message(s). It will continue to repeat this method until the device crashes or reaches the time limit. Such a method based on randomness helps IotFuzzer to mutate and test a large number of message data domains in the original input and collect response message categories quickly in the beginning. However, the number of response categories found by IotFuzzer will soon reaches the limitation as it focuses on data domain mutation only. In most devices, Snipuzz has maintained a steady upward trend in most cases, and after a period of time, the number of response categories found by Snipuzz exceeds IotFuzzer. Unlike IotFuzzer, Snipuzz covers response categories through the Snippet Determination stage. Based on the message snippet exploration strategy, Snipuzz first explores all the response categories of a certain message as many as possible, then the snippets of a message are obtained and tested by Snippet Mutation. The next message will be processed in the same way until all messages in the initial message sequence have been tested. Followed by this method, Snipuzz may not get a large number of response categories at the very beginning. When Snipuzz detects a message snippet, every byte in the message content will be included in the test. Therefore, as shown by the bold numbers in Table 3, for 15 out of 20 devices, Snipuzz covers the most number of response categories after 24-hour fuzz testing, compared with other state-of-the-art IoT fuzzing tools. On 5 devices, Snipuzz-NoSnippet collected more response categories than Snipuzz within 24 hours. The mutation method used by Snipuzz-NoSnippet is similar to the classic fuzzer AFL [24]. It directly performs mutation on a single byte or several consecutive bytes. However, Snipuzz-NoSnippet is difficult to cover response categories that are not obtained by breaking the grammatical format (e.g., data out of bounds in the data domain). Theoretically, although the Snipuzz-NoSnippet mutation method is not so efficient, it still has the capability to explore the most categories of responses. Nemesys explores more categories of responses than BooFuzz and Doona, but does not exceed Snipuzz. The Nemesys strategy performs deterministic mutations on each data domain of the messages in turn, which makes its trend of run-time performance similar to Snipuzz. However, the data domain determination strategy of Nemesys is not based on the responses from IoT device. Thus, the distribution of byte values in messages does not benefit in covering more response categories. Therefore, the number of response categories collected by Nemesys is limited. It is interesting to observe that, in the case of R6400, Snipuzz also enters a stagnation after only finding a few response categories. We carefully checked the initial input message sequences and found that the average length of the message exceeds 400 bytes, forcing Snipuzz to generate and send a large number of probe messages to determine message snippets. As a result, in the first 10 minutes, Snipuzz was still exploring the response category of the first few messages, which limited its performance.

Figure 5: Runtime performance. The number of categories discovered in 10 minutes on all the 20 IoT devices. Snipuzz performed best on 19 devices.

5.4 Assessment on Message Snippet Inference

Among all strategies, Snipuzz and Nemesys utilize semantic segmentation, to assess their message snippet inference performance. We compare the snippets they produce during the fuzzing process with the grammar rules defined in API documents. Specifically, for some mature and popular languages, such as JSON, we establish the grammar rules as per their standard syntax; for custom formats, such as strings or custom bytes, we refer to the official API documents and define the grammar rules based on the instructions. Equation (2) quantifies the quality of snippet inference, and Similarity indicates the percentage of correctly categorized bytes in a snippet determined message, 𝑚, compared with the ground truth, 𝑔, manually extracted from the grammar rules.

where 𝑐𝑎𝑡𝑒 () returns the category of each message byte in a series of “0” and “1” bits, 𝑐𝑜𝑢𝑛𝑡() counts the number of mis-categorized bytes, and 𝑙𝑒𝑛() represents the length of a message. Note that in a ground truth message, “0” indicates the non-data domain (marked blue in Table 5), while “1” indicates the data domain (marked red in Table 5). Therefore, the ⊕ is the bitwise 𝑋𝑂𝑅 operation. In addition, followed by Equation (2), we compute the average similarity of the snippets (or data domain) determined by Snipuzz and Nemesys for all the 235 messages obtained from experiments. Note that during the calculation of the average similarity, for each message, if there are multiple snippet sets determined, we will select the snippet inference with the highest similarity value; therefore we present an upper-bound of performance.

The average similarity result of Snipuzz, 87.1%, indicates that, by applying snippet inference based on the hierarchical clustering approach, Snipuzz can effectively find the grammatical rules hidden in the message. Ideally, in Snipuzz, the merging of clusters removes the influence caused by the randomness in responses and by the replying message mechanism itself. Therefore, the message snippets will conform to the grammatical rules gradually, which leads Snipuzz to a higher similarity result.However, we also found some differences between the snippet inference method and the grammatical rules in some results. For example, given the example shown in Table 5, the snippet inference method combines the strings belonging to the data domain in the grammatical rules (i.e., ‘true’, ‘140’ and ‘254’) with some placeholders (such as double quotes and curly brackets). After analyzing the response messages, we found that the responses obtained after destroying these data domains and destroying placeholders are all about invalid format. This may due to the fact that in the firmware, when an error occurs in the parsing format, the response does not report a detailed description of the error but instead returns a general format error.

Figure 6: An example of vulnerability triggering

On the other hand,Nemesys uses the distribution of value changes in the protocol to determine the boundary of different data domains and achieve the semantic segmentation of a message. The advantage of this method is that it does not require any other additional information, such as grammar rules or a large number of training data sets in addition to the message itself. The average similarity result of Nemesys, 64.5%, is lower than the Snipuzz result. Given the example shown in Table 5, when segmenting messages in a format requires restricted syntax, such as Json and XML, Nemesys can achieve a good semantic segmentation performance, because the placeholders usually use symbols unusually used in data domains. This distribution of byte value enables Nemesys to effectively find the boundaries between data domains. However, in IoT devices, customized formats are prevalent. For example, the smart bulb BR30 uses custom bytes as a means of communication, where each byte corresponds to a special meaning (i.e., “0x61” represents “CHANGE_MODE” and “0x0f” represents “TRUE”). In such cases, the value distribution of characters can no longer be used as a guidance for the data domain determination, and thus the message segmentation determined by Nemesys is error-prone.

5.5 Mutation Effectiveness: A Case Study The HS100 and HS110 manufactured by TP-Link are two classic market consumer-grade smart plugs. In the work by Chen et al. [9], they use HS110 with firmware version 1.3.1 to test IoTFuzzer. The results of their experiment show that IoTFuzzer triggered a vulnerability in the device by mutating the data domain in a message (changing “light” to 0). However, in the updated version of the firmware (version 1.5.2), IoTFuzzer did not find any vulnerabilities. Figure 6 shows an example of the original input message and the mutated snippets in a message that can trigger the vulnerability. In this case, Snipuzz triggered a vulnerability related to firmware input by breaking the JSON syntax structure in the message. The intention of the original message is to change some attributes (e.g., “stime_opt” and “wday”) in a rule (inferred by “edit_rule”). In the mutated message, Snipuzz randomly deleted some contents (inside the red frame), which break the JSON syntax. This may cause errors about parsing messages or passing parameters incorrectly handled by the firmware and, consequently, crashes the device.

Figure 7: A vulnerable code snippet from HS110 firmware.

To further determine the root cause of the crash, we obtainedthe firmware source code. Figure 7 shows a code snippet from the firmware, using cJSON,3 a popular open-source lightweight JSON parser (5.4k stars in GitHub), to interpret input message fragments. The jalr instruction will save the result of cJson_GetObjectItem in $t9 and jump to this address unconditionally (see line 3 in Figure 7), which means the firmware will pick the value corresponding to “schedule”. In the original message, the value corresponding to “schedule” is a JSON object headed by “edit_rule” (from line 4 to line 16 in Figure 6). Note that the aforementioned snippet-based mutation strategy implemented in Snipuzz is able to break the syntax structure and mutate on data and non-data domains at the same time. Interestingly, although the removing of two left curly braces breaks the JSON syntax, it is not recognized by cJSON parser, so the mutated message successfully bypasses the syntax validation and enters the functional code in firmware. When the firmware tries to access the successor JSON object in “schedule”, i.e., the object starts with “edit_rule”, since the corresponding value is no more a JSON object, but an array, a null pointer exception is triggered. Due to the design of IoTFuzzer, the fuzzing based on grammatical rules will offer priority to satisfying the grammar requirements in the mutation process in order not to be rejected by the firmware grammar detector. The advantage of this is to ensure that each test case can reach the functional execution part of the firmware. However, in this case, the test range of fuzzing based on grammatical rules cannot cover the firmware sanitising part. To conclude, the root cause of the crash has two factors: 1) the validation of message syntax heavily relies on a third-party library; 2) the firmware does not correctly handle the null pointer exception caused by data type mismatch. Although it is not reasonable to require a vendor to develop products purely from scratch, we argue that thorough testing and validation on the open-source library are essential. Considering the complexity of IoT firmware testing, a lightweight and effective black-box vulnerability detection tool, such as Snipuzz, is a pressing need.

6 DISCUSSION AND LIMITATIONS

Snipuzz has successfully examined 20 different devices and exposed security vulnerabilities on five of them. However, there are still some factors limit the efficiency and scalability of Snipuzz. Scalability and manual effort. In our prototype, we capture communication packets through API programs and monitoring network communication. Even in the absence of API programs or documents, the message formats can be determined from the official Apps of IoT devices through decompilation and taint analysis. Otherwise, we can solve this problem by intercepting the communication between APPs and IoT devices, and then recovering message formatsfrom the captured packets. However, both methods could introduce overhead and involve manual effort. Recall that Snipuzz requires 5 man-hours per device to collect the initial seeds (Section 4.1. The manual efforts mainly focus on cleaning the packets from the API programs that are obtained from publicly available first- and third party resources. To mitigate, when applying Snipuzz to IoT devices, techniques such as crawlers could be used to automatically gather API programs. Moreover, the process of cleaning the packets could also be improved by pre-processing keywords through scripts to achieve automatic collection of communication packages. Threats to validity. As Snipuzz collects initial message sequences via API programs and network sniffers, the first threat comes from the absence of API programs. In this case, we can recover message formats based on the companion apps of IoT devices (similar to IoTFuzzer), but may lead to more manual efforts. Second, the encryption in messages decreases the effectiveness of snippet determination because the semantic information could be corrupted. A potential solution to the encryption issue is to integrate decryption modules into Snipuzz. Finally, the code coverage of firmware could be subject to the accessibility of API programs, since Snipuzz can only examine the functionalities that are covered in API programs. Recombining the message snippets from different seeds to generate new valid inputs could mitigate this limitation. Code coverage. The code coverage of firmware explored by Snipuzz depends on the API programs. For example, if the API programs of a bulb only support the functionality of turning on power, it is almost impossible to explore the functionality of adjusting the brightness via mutating the messages captured during the power turned on. In the future work, without the support of grammar, we will consider recombining the message snippets to try to generate new valid inputs. This method can help explore more firmware execution coverage in addition to the original inputs provided. Requirements on detailed responses. The detection effectiveness of Snipuzz depends on the quality of message snippets which is contingent on how much information could be obtained from the responses of IoT devices. To put differently, if the IoT device does not provide responses that are detailed enough, for example reporting all the errors with a uniform message, it could be hard for Snipuzz to determine the message snippets. Fortunately, in many IoT devices, advanced error descriptions could be obtained in debug mode which will significantly improve the determination process of message snippets in Snipuzz.

7 RELATED WORK

Our Snipuzz performs in a black-box manner for detecting vulnerabilities in IoT devices. Unlike existing black-box fuzzing for IoT devices, which blindly mutates messages, Snipuzz optimizes the mutation process of black-box fuzzing via utilizing responses. This feedback mechanism improves the effectiveness of bug discovery. For instance, IoTFuzzer [9] obtains the data domain, on which IoTFuzzer performs blind mutation. Thus, IoTFuzzer lacks the knowledge of the quality of the generated inputs, resulting in a waste of resource on the low-quality inputs. There are also several dynamic analysis approaches focusing on the networking modules of IoT devices. For example, SPFuzz defines a new language for describing protocol specifications, protocol state transitions, and their correlations [37]. SPFuzz can ensure the correctness of the message format in the conversation state and the dependence of the protocol. IoTHunter is a grey-box approach to fuzz the state protocol of IoT firmware [47]. IoTHunter can constantly switch the protocol state to perform a feedback-based exploration of IoT devices. In a recent example, AFLnet acts as a client and continuously replays the variation of the original message sequence sent to target (i.e., server or device) [33]. AFLnet uses response codes, which are the numbers indicating the execution states, to identify the execution status of targets and explore more regions of their networking modules. Another research line for dynamic analysis of IoT devices is the usage of emulators. The disadvantages of emulation are the heavy engineering efforts and the requisite of firmware, although the emulation of IoT firmware can analyze more thoroughly than black-box fuzzing. Two major challenges for emulation of IoT firmware are the scalability and throughput. Therefore, the efforts in improving the performance of emulation include full-system emulation [8, 27], improvement of emulation success rates [21], hardware-independent emulation [17, 38], and combination of user- and system-mode emulation [51]. Based on the emulation, fuzzing can be integrated into those frameworks and can hunter defects in firmware [38, 51]. Static analysis of firmware is the complementary approach of dynamic analysis. Semantic similarity is one of the major techniques that make static analysis successful. Researchers analyze semantic similarity via comparison of files and modules [13], Control Flow Graphs (CFGs) [14], parser and complex processing logic [11], and multi-binary interactions [35]. There are also many similaritybased approaches that can detect vulnerabilities across different firmware architectures. They usually extract various architectureindependent features from firmware for each node in a CFG to represent a function, and then check whether two functions’ CFG representations are similar [15, 32].

8 CONCLUSION

In this paper we have presented a black-box fuzzing framework Snipuzz designed for detecting vulnerabilities hiding in IoT devices. Different from other black-box network fuzz testing, Snipuzz uses the response messages returned by the device to establish a feedback mechanism for guiding the fuzzing mutation process. In addition, Snipuzz infers the grammatical role of each byte in the messages based on the responses from the device, so that Snipuzz can generate test cases that meet the device’s grammar without the guidance of grammatical rules. We have used 20 consumer-grade IoT devices from the market to test Snipuzz, and it has successfully found 5 zero-day vulnerabilities on 5 different devices. ACKNOWLEDGEMENTS We thank all the anonymous reviewers for their valuable feedback. Minhui Xue was, in part, supported by the Australian Research Council (ARC) Discovery Project (DP210102670) and the Research Center for Cyber Security at Tel Aviv University established by the State of Israel, the Prime Minister’s Office and Tel Aviv University. In addition, this research is partially supported by the Australian Research Council projects DP200100886 and LP180100170.

REFERENCES

[1] 2020. The Three Software Stacks Required for IoT Architectures. IoT Eclipse

(White Paper) (2020).

[2] Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig,

Ahmad-Reza Sadeghi, and Daniel Teuchert. 2019. NAUTILUS: Fishing for deep

bugs with grammars.. In The Network and Distributed System Security Symposium

(NDSS).

[3] I. Ashraf, X. Ma, B. Jiang, and W. K. Chan. 2020. GasFuzzer: Fuzzing ethereum

smart contract binaries to expose gas-oriented exception security vulnerabilities.

IEEE Access (2020).

[4] Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury.

2017. Directed greybox fuzzing. In Proceedings of the 2017 ACM SIGSAC Conference

on Computer and Communications Security.

[5] Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2016. Coverage based greybox fuzzing as markov chain. In Proceedings of the 2016 ACM SIGSAC

Conference on Computer and Communications Security.

[6] Kali Bot. 2019. bed. https://gitlab.com/kalilinux/packages/bed.

[7] Z. Berkay Celik, Patrick McDaniel, and Gang Tan. 2018. Soteria: Automated

IoT safety and security analysis. In 2018 USENIX Annual Technical Conference

(USENIX ATC 18).

[8] Daming D Chen, Maverick Woo, David Brumley, and Manuel Egele. 2016. Towards

automated dynamic analysis for Linux-based embedded firmware. In The Network

and Distributed System Security Symposium (NDSS).

[9] Jiongyi Chen, Wenrui Diao, Qingchuan Zhao, Chaoshun Zuo, Zhiqiang Lin,

XiaoFeng Wang, Wing Cheong Lau, Menghan Sun, Ronghai Yang, and Kehuan

Zhang. 2018. IOTFUZZER: Discovering memory corruptions in IoT through

app-based fuzzing. In The Network and Distributed System Security Symposium

(NDSS).

[10] Abraham Clements, Eric Gustafson, Tobias Scharnowski, Paul Grosen, David

Fritz, Christopher Kruegel, Giovanni Vigna, Saurabh Bagchi, and Mathias Payer.

2020. HALucinator: Firmware re-hosting through abstraction layer emulation.

In Proceedings of the 29th USENIX Security Symposium (USENIX ’20).

[11] Lucian Cojocar, Jonas Zaddach, Roel Verdult, Herbert Bos, Aurélien Francillon,

and Davide Balzarotti. 2015. PIE: Parser identification in embedded systems.

[12] Jake Corina, Aravind Machiry, Christopher Salls, Yan Shoshitaishvili, Shuang

Hao, Christopher Kruegel, and Giovanni Vigna. 2017. Difuze: Interface aware

fuzzing for kernel drivers. In Proceedings of the 2017 ACM SIGSAC Conference on

Computer and Communications Security.

[13] Andrei Costin, Jonas Zaddach, Aurélien Francillon, and Davide Balzarotti. 2014.

A Large-Scale Analysis of the Security of Embedded Firmwares. In 23rd USENIX

Security Symposium (USENIX Security 14).

[14] Thomas Dullien and Rolf Rolles. 2005. Graph-based comparison of executable

objects (english version). Journal of Computer Virology and Hacking Techniques

(2005).

[15] Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE:

Efficient cross-architecture identification of bugs in binary code. In Network and

Distributed Systems Security (NDSS).

[16] Pwnie Express. 2020. What makes IoT so vulnerable to attack? Technical Report.

Outpost24.

[17] Bo Feng, Alejandro Mera, and Long Lu. 2020. P2IM: Scalable and hardware independent firmware testing via automatic peripheral interface modeling. In

29th {USENIX} Security Symposium ({USENIX} Security 20).

[18] Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng

Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings

of the 2016 ACM SIGSAC Conference on Computer and Communications Security.

[19] Fitblip. 2019. Sulley. https://github.com/OpenRCE/sulley.

[20] Eric Gustafson, Marius Muench, Chad Spensky, Nilo Redini, Aravind Machiry,

Yanick Fratantonio, Davide Balzarotti, Aurélien Francillon, Yung Ryn Choe,

Christophe Kruegel, et al. 2019. Toward the analysis of embedded firmware

through automated re-hosting. In 22nd International Symposium on Research in

Attacks, Intrusions and Defenses ({RAID} 2019).

[21] Mingeun Kim, Dongkwan Kim, Eunsoo Kim, Suryeon Kim, Yeongjin Jang, and

Yongdae Kim. 2020. FirmAE: Towards large-scale emulation of IoT firmware for

dynamic analysis. In Annual Computer Security Applications Conference.

[22] Stephan Kleber, Henning Kopp, and Frank Kargl. 2018. NEMESYS: Network

message syntax reverse engineering by analysis of the intrinsic structure of individual messages. In 12th {USENIX} Workshop on Offensive Technologies ({WOOT}18).

[23] Karla Lant. 2017. By 2020, there will be 4 devices for every human on earth.

Futurism (2017).

[24] lcamtuf. 2017. AFL. https://lcamtuf.coredump.cx/afl/.

[25] Trend Micro. 2020. Mirai botnet exploit weaponized to attack IoT devices via

CVE-2020-5902. Technical Report. Security Intelligence Blog.

[26] Trend Micro. 2020. Smart yet flawed: IoT device vulnerabilities explained. Technical

Report. Security News.

[27] Marius Muench, Jan Stijohann, Frank Kargl, Aurélien Francillon, and Davide

Balzarotti. 2018. What you corrupt is not what you crash: Challenges in fuzzing

embedded devices. In NDSS 2018, Network and Distributed Systems Security Symposium.

[28] Lindsey O’Donnell. 2020. More than half of IoT devices vulnerable to severe attacks.

Technical Report. ThreatPost.

[29] Sebastian Österlund, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2020.

ParmeSan: Sanitizer-guided greybox fuzzing. In 29th USENIX Security Symposium

(USENIX Security 20).

[30] Peachtech. 2021. PEACH: The PEACH fuzzer platform. https://www.peach.tech/

products/peach-fuzzer/ Accessed: 2021-01.

[31] Joshua Pereyda. 2017. boofuzz: Network protocol fuzzing for humans. https:

//boofuzz.readthedocs.io/en/stable/.

[32] Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten

Holz. 2015. Cross-architecture bug search in binary executables. In 2015 IEEE

Symposium on Security and Privacy (SP).

[33] Van-Thuan Pham, Marcel Böhme, and Abhik Roychoudhury. 2020. AFLNET:

A greybox fuzzer for network protocols. In IEEE International Conference on

Software Testing, Verification and Validation (ICST) 2020.

[34] Van-Thuan Pham, Marcel Böhme, Andrew Edward Santosa, Alexandru Razvan

Caciulescu, and Abhik Roychoudhury. 2019. Smart greybox fuzzing. IEEE Transactions on Software Engineering (2019).

[35] Nilo Redini, Aravind Machiry, Ruoyu Wang, Chad Spensky, Andrea Continella,

Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. 2020. Karonte:

Detecting insecure multi-binary interactions in embedded firmware. In 2020 IEEE

Symposium on Security and Privacy (SP).

[36] Sergej Schumilo, Cornelius Aschermann, Robert Gawlik, Sebastian Schinzel, and

Thorsten Holz. 2017. kAFL: Hardware-assisted feedback fuzzing for OS Kernels.

In 26th USENIX Security Symposium (USENIX Security 17).

[37] Congxi Song, Bo Yu, Xu Zhou, and Qiang Yang. 2019. SPFuzz: a hierarchical

scheduling framework for stateful network protocol fuzzing. IEEE Access (2019).

[38] Prashast Srivastava, Hui Peng, Jiahao Li, Hamed Okhravi, Howard Shrobe, and

Mathias Payer. 2019. FirmFuzz: automated IoT firmware introspection and analysis. In Proceedings of the 2nd International ACM Workshop on Security and Privacy

for the Internet-of-Things.

[39] Liam Tung. 2017. IoT devices will outnumber the world’s population this year for

the first time. Technical Report. ZDNet.

[40] Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2017. Skyfire: Data-driven

seed generation for fuzzing. In 2017 IEEE Symposium on Security and Privacy

(SP).

[41] Yanhao Wang, Xiangkun Jia, Yuwei Liu, Kyle Zeng, Tiffany Bao, Dinghao Wu, and

Purui Su. 2020. Not all coverage measurements are equal: Fuzzing by coverage

accounting for input prioritization. In The Network and Distributed System Security

Symposium (NDSS).

[42] Wikipedia. 2021. Edit distance. https://en.wikipedia.org/wiki/Edit_distance.

[43] Wikipedia. 2021. Hierarchical clustering. https://en.wikipedia.org/wiki/

Hierarchical_clustering.

[44] wireghoul. 2019. Doona. https://github.com/wireghoul/doona.

[45] wireshark. 2020. About wireshark. https://www.wireshark.org/about.html.

[46] Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017.

Neural network-based graph embedding for cross-platform binary code similarity

detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and

Communications Security.

[47] Bo Yu, Pengfei Wang, Tai Yue, and Yong Tang. 2019. Poster: Fuzzing iot firmware

via multi-stage message generation. In Proceedings of the 2019 ACM SIGSAC

Conference on Computer and Communications Security.

[48] Y. Yu, Z. Chen, S. Gan, and X. Wang. 2020. SGPFuzzer: A state-driven smart

graybox protocol fuzzer for network protocol implementations. IEEE Access

(2020).

[49] Tai Yue, Pengfei Wang, Yong Tang, Enze Wang, Bo Yu, Kai Lu, and Xu Zhou. 2020.

EcoFuzz: Adaptive energy-saving greybox fuzzing as a variant of the adversarial

multi-armed bandit. In 29th USENIX Security Symposium (USENIX Security 20).

[50] Jonas Zaddach, Luca Bruno, Aurelien Francillon, Davide Balzarotti, et al. 2014.

AVATAR: A framework to support dynamic security analysis of embedded sys

tems’ firmwares. In The Network and Distributed System Security Symposium

(NDSS).

[51] Yaowen Zheng, Ali Davanian, Heng Yin, Chengyu Song, Hongsong Zhu, and

Limin Sun. 2019. FIRM-AFL: High-throughput greybox fuzzing of IoT firmware

via augmented process emulation. In 28th USENIX Security Symposium (USENIX

Security 19)