3.2. Datasets Used

Zhipeng Liu; Niraj Thapa; Addison Shaver; Kaushik Roy; Madhuri Siddula; Xiaohong Yuan; Anna Yu

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

3.2. Datasets Used

ZL Zhipeng Liu

NT Niraj Thapa

AS Addison Shaver

KR Kaushik Roy

MS Madhuri Siddula

XY Xiaohong Yuan

AY Anna Yu

This method is extracted from research article: Sensors (Basel), Jul 2021

Using Embedded Feature Selection and CNN for Classification on CCD-INID-V1—A New IoT Dataset

DOI: 10.3390/s21144834

Request a Protocol

Ask a question

Favorite

The following section discusses the three datasets used for evaluating our models in detail.

We collected and developed the CCD-INID-V1 dataset at Center for Cyber Defense, North Carolina A&T State University.

This section discusses the data collection process. In [100], Ullah et al. compare the setup to various datasets. The compared datasets simulate traffic to mimic real-world networks. The data generations originate from both physical and virtual devices. Most of these datasets are created in virtual environment, but they are used to provide network security solution in use case scenarios ranging from smart home to smart cities.

In [101], authors provide a secure virtual framework that was built in a smart home environment. The proposed framework is created to be further applied on all virtual smart use cases, from smart cars to smart factories. Their research projects data in a similar manner to our work: use Pis equipped with environmental sensors to collect direct readings, such as temperature, pressure, and upload to cloud server via a high-level protocol. The communications occur using a mixture of protocols: SSH combined with HTTP, which essentially forms HTTPS.

In a smart home use case, smart fridge and smart thermostats, such as Nest, only needs to collect temperature reading and upload the reading to the cloud server. In a smart lab scenario, real-time temperature and pressure readings are constantly uploaded to cloud server. Researchers and lab administrators rely on these readings to preserve lab environments. So even we used Pis, the usage of such a specific device can be generalized. The behavior of the Rainbow HAT resembles the characteristics of those smart devices that execute one-dimensional jobs. We collected our data in both smart home and smart lab environments. Since most active smart devices network behavior can be dissected using NetFlow, which is designed by Cisco, we monitor the NetFlows of these devices and inject real cyberattacks. We are applying a feature engineering solution in NFStream, which is a flow-based feature generation tool.

As listed in Figure 3, we developed our application on an Android Studio, which is the official integrated development environment (IDE) for the Google-owned Android operating system [102]. We require the application to initiate smart sensors to capture environmental data, and transmit to a cloud-based database, as shown in Figure 4 and Figure 5. The smart sensors originate from a smart-board, Rainbow HAT [103], which is equipped directly on the mini-computer, Raspberry Pi version 3B [104], running on the open-sourced Android Things operating system [105]. Every 2 s, the sensor board captures moisture and temperature of the surroundings. A webserver installed with Wireshark is used to listen to the network traffic in and out of the smart devices. The devices are connected to the webserver through Android Debug Bridge (adb). At random time intervals and using multiple source devices, which include both physical and virtual bots, we launched multiple cyberattacks at the target device. Further details about the attacks are described in Section 3.2.2. We used 4 Raspberry Pis and collected data in two smart environments: smart home and smart lab. All web traffic in and out from the smart devices is exchanged over WiFi connections. The raw captured data totals over 50 GB. The raw data is then converted, and feature engineered using an open-source library, NFStream [106], which is described in detail in Section 3.2.3. After feature engineering, we are able to get 83 features. After labeling and concatenation, we produce the final data file for further experiments.

Data collection process.

Flow of data in the collection process.

(a) The network architecture of CCD-INID-V1. (b) Flow of data in a typical IoT network architecture.

Sensor readings are encrypted and transmitted through an authenticated channel with random-path-based routing to ensure data privacy. We established handshake and key exchanges using a built-in application programming interface (API) in Android Studio connected to Firebase. We organize data using the rules engine in Firebase to prevent data-injection attacks. The flow of data can be seen in Figure 4.

Based on our security architecture, as shown in Figure 5a, we are mainly focusing on the transmissions between edge devices with cloud servers, where the analysis computing is conducted. At the edge Layer, which contains live sensors, data is originated from the IoT things. By communicating through WiFi and adb port forwarding, we not only monitor the data, but we manufacture features at the local server, hence computing at the fog layer. In smart homes and smart labs, WiFi is one of the most widely used short-range transmission protocols, which also include RFID, WLAN, 6LowWPAN, ZigBee, Bluetooth, NFC, and Z-Wave [107]. The sensors have a direct channel to communicate via HTTPS with the cloud server, where the database is located. In this sense, we are using a hybrid format of computing at both fog and cloud layers [47]. To show that our method is able to identify patterns from traffic through information-hiding, we chose HTTPS as end-to-end communication protocol over HTTP. We want to see how well our method is able to perform without compromising the privacy of users.

As summarized by [108], long-range (higher level) transmission protocols include MQTT, CoAP, AMQP and HTTP(S). In terms of messages size, MQTT can hold the least amount and HTTP(S) can hold the largest. Since we are proposing a solution that is applicable in any IoT environments, from smart home to smart cities, we considered the various long-range protocols. Given the universal usage of HTTP(S), we selected HTTP(S) as our choice of transmission protocol. HTTP(S) is a part of the IP suite of TCP/IP. As the most widely used transmission protocol in the world, TCP/IP includes HTTP, HTTPS, FTP, and MQTT. HTTPS offers the advantage of transmitting the largest message size along with end-to-end information-hiding. With the advancement of technologies such as 5G, we do not necessarily need to reduce message size. Furthermore, we want to show how we are able to detect anomalies without the need to identify what is inside a packet. In other words, we are able to identify threats while ensuring consumer privacy. Many users use TCP/IP protocols to address problems that are found in IoT use cases [109,110,111,112,113,114].

In [109], Alavi et al. apply MQTT along with TCP/IP to transmit data in their data collection process. In [110], the author uses WiFi and ZigBee to transmit data between devices within LAN and uses TCP/IP protocols to transmit data between multiple data relays across the internet. Moreover, a lot of smart devices rely on Application Programming Interface (API) services, notably Representational State Transfer (REST) API, to communicate [111,112,113,114]. REST API is mainly implemented on these protocols: HTTP(S), URI, JSON, and XML.

Although we are applying our current method in smart home and smart labs, but our goal is to extend our method and apply to smart campus, smart cities, smart factory, and smart grid/infrastructures.

Even though we only used 4 Pis, as seen in Figure 6a, the usage of such specific devices can be generalized. The behavior of the Rainbow HAT, as shown in Figure 6b resembles the characteristics of those smart devices that execute one-dimensional jobs, such as smart lights, smart thermometer, smart doorlocks without cameras.

Photo of Raspberry Pi and Rainbow HAT. (a) Shutdown status; (b) app running.

We selected five frequently used attacks in the creation of our dataset. The five attacks are Address Resolution Protocol (ARP) Poisoning, ARP Denial-of-Service (DoS), UDP Flood, Hydra Bruteforce with Asterisk protocol, and SlowLoris. Table 2 describes each attack in detail. Here are the reasonings behind the selection of each attack:

ARP Poisoning—ARP Poisoning generates minimum web traffic. It is extremely challenging for IDS to pick up the signature of this type of attack. We wanted to see how well our IDS can handle this attack signature with limited trace.

ARP DoS—This attack leaves plenty of “breadcrumbs” for IDS to pick up. We sent 600,000 messages at our only available socket at a one-second interval continuously for 12 h.

UDP Flood—Similar to the previous attack, however this attack uses a different protocol. We wanted to test how our IDS handle network traffic with different protocols.

Hydra Bruteforce with Asterisk protocol—This type of attack attempts to gain authentication using commonly used password combinations. Hydra is a well-known attack toolkit. The Asterisk protocol is an interesting choice for our attack selection because it is a protocol that is standard for voice-over-IP, which relates to many users that rely on communication tools such as Zoom, Skype, WeChat, WhatsApp during the COVID-19 pandemic.

SlowLoris—SlowLoris is a new representation for low-bandwidth Distributed Denial-of-Service attacks [115]. First developed by a hacker named Robert “RSnake” Hansen, this attack can bring down high-bandwidth servers with a single botnet computer, as evidenced in the 2009 Iranian presidential election [116].

Attacks on CCD-INID-V1 Dataset.

For our dataset, we used NFStream to engineer the features. NFStream is an open-source Python API library that provides flexible and quick feature conversion to make live or offline network data more intuitive. The designers have the broader goal of making the library a common network data analytics framework for researchers providing data reproducibility across experiments, hence standardization. NFStream offers the following benefits:

Statistical features extraction: NFStream provides the post-mortem statistical features (e.g., min, mean, stddev and max of packet size and inter arrival time) and early flow features (e.g., sequence of first n packets sizes, inter arrival times and directions).

Flexibility: NFStream is easily extensible. The project is open-sourced and NFPlugins can be used for feature engineering.

NFStream is built upon the concept of flow-based aggregation. Based on the shared commonalities, such as flow key, transport protocol, VLAN identifier, source and destination IP address, the packets are aggregated into flows. From a flow’s entry until its termination, a flow cache is used to keep trace (e.g., active timeout, inactive timeout). If the entry is present in the flow cache, counters and several other metrics are updated periodically. If flows are generated in both directions, the flow cache applies a bidirectional flow definition, which includes adding counters and metrics for both directions.

In the above schema, NFStream overall architecture is depicted and could be summarized as follows:

NFStreamer is a driver process. The driver’s main responsibility involves setting the overall workflow, which is mostly an orchestration of parallel metering processes.

Meters make up the integral parts to the NFStream framework. Meters transform information gathered through flow-aggregation into statistical features until flow is terminated by expiration (active timeout, inactive timeout). After processing (e.g., timestamped, decoded, truncated), raw packets are dispatched across meters.

After processed by Meters, a flow becomes NFlow, the lexicon used in NFStream. New flow features are engineered based on the configurations set by NFStreamer. In Table 3, we list features that are extracted.

Features generated for CCD-INID-V1 dataset [106].

The dataset contains 83 features, including source and destination string representation of IP and MAC addresses, flow bidirectional packets accumulator, and multiple timestamps.

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol