We collected and developed the CCD-INID-V1 dataset at Center for Cyber Defense, North Carolina A&T State University.
This section discusses the data collection process. In [100], Ullah et al. compare the setup to various datasets. The compared datasets simulate traffic to mimic real-world networks. The data generations originate from both physical and virtual devices. Most of these datasets are created in virtual environment, but they are used to provide network security solution in use case scenarios ranging from smart home to smart cities.
In [101], authors provide a secure virtual framework that was built in a smart home environment. The proposed framework is created to be further applied on all virtual smart use cases, from smart cars to smart factories. Their research projects data in a similar manner to our work: use Pis equipped with environmental sensors to collect direct readings, such as temperature, pressure, and upload to cloud server via a high-level protocol. The communications occur using a mixture of protocols: SSH combined with HTTP, which essentially forms HTTPS.
In a smart home use case, smart fridge and smart thermostats, such as Nest, only needs to collect temperature reading and upload the reading to the cloud server. In a smart lab scenario, real-time temperature and pressure readings are constantly uploaded to cloud server. Researchers and lab administrators rely on these readings to preserve lab environments. So even we used Pis, the usage of such a specific device can be generalized. The behavior of the Rainbow HAT resembles the characteristics of those smart devices that execute one-dimensional jobs. We collected our data in both smart home and smart lab environments. Since most active smart devices network behavior can be dissected using NetFlow, which is designed by Cisco, we monitor the NetFlows of these devices and inject real cyberattacks. We are applying a feature engineering solution in NFStream, which is a flow-based feature generation tool.
As listed in Figure 3, we developed our application on an Android Studio, which is the official integrated development environment (IDE) for the Google-owned Android operating system [102]. We require the application to initiate smart sensors to capture environmental data, and transmit to a cloud-based database, as shown in Figure 4 and Figure 5. The smart sensors originate from a smart-board, Rainbow HAT [103], which is equipped directly on the mini-computer, Raspberry Pi version 3B [104], running on the open-sourced Android Things operating system [105]. Every 2 s, the sensor board captures moisture and temperature of the surroundings. A webserver installed with Wireshark is used to listen to the network traffic in and out of the smart devices. The devices are connected to the webserver through Android Debug Bridge (adb). At random time intervals and using multiple source devices, which include both physical and virtual bots, we launched multiple cyberattacks at the target device. Further details about the attacks are described in Section 3.2.2. We used 4 Raspberry Pis and collected data in two smart environments: smart home and smart lab. All web traffic in and out from the smart devices is exchanged over WiFi connections. The raw captured data totals over 50 GB. The raw data is then converted, and feature engineered using an open-source library, NFStream [106], which is described in detail in Section 3.2.3. After feature engineering, we are able to get 83 features. After labeling and concatenation, we produce the final data file for further experiments.
Data collection process.
Flow of data in the collection process.
(a) The network architecture of CCD-INID-V1. (b) Flow of data in a typical IoT network architecture.
Sensor readings are encrypted and transmitted through an authenticated channel with random-path-based routing to ensure data privacy. We established handshake and key exchanges using a built-in application programming interface (API) in Android Studio connected to Firebase. We organize data using the rules engine in Firebase to prevent data-injection attacks. The flow of data can be seen in Figure 4.
Based on our security architecture, as shown in Figure 5a, we are mainly focusing on the transmissions between edge devices with cloud servers, where the analysis computing is conducted. At the edge Layer, which contains live sensors, data is originated from the IoT things. By communicating through WiFi and adb port forwarding, we not only monitor the data, but we manufacture features at the local server, hence computing at the fog layer. In smart homes and smart labs, WiFi is one of the most widely used short-range transmission protocols, which also include RFID, WLAN, 6LowWPAN, ZigBee, Bluetooth, NFC, and Z-Wave [107]. The sensors have a direct channel to communicate via HTTPS with the cloud server, where the database is located. In this sense, we are using a hybrid format of computing at both fog and cloud layers [47]. To show that our method is able to identify patterns from traffic through information-hiding, we chose HTTPS as end-to-end communication protocol over HTTP. We want to see how well our method is able to perform without compromising the privacy of users.
As summarized by [108], long-range (higher level) transmission protocols include MQTT, CoAP, AMQP and HTTP(S). In terms of messages size, MQTT can hold the least amount and HTTP(S) can hold the largest. Since we are proposing a solution that is applicable in any IoT environments, from smart home to smart cities, we considered the various long-range protocols. Given the universal usage of HTTP(S), we selected HTTP(S) as our choice of transmission protocol. HTTP(S) is a part of the IP suite of TCP/IP. As the most widely used transmission protocol in the world, TCP/IP includes HTTP, HTTPS, FTP, and MQTT. HTTPS offers the advantage of transmitting the largest message size along with end-to-end information-hiding. With the advancement of technologies such as 5G, we do not necessarily need to reduce message size. Furthermore, we want to show how we are able to detect anomalies without the need to identify what is inside a packet. In other words, we are able to identify threats while ensuring consumer privacy. Many users use TCP/IP protocols to address problems that are found in IoT use cases [109,110,111,112,113,114].
In [109], Alavi et al. apply MQTT along with TCP/IP to transmit data in their data collection process. In [110], the author uses WiFi and ZigBee to transmit data between devices within LAN and uses TCP/IP protocols to transmit data between multiple data relays across the internet. Moreover, a lot of smart devices rely on Application Programming Interface (API) services, notably Representational State Transfer (REST) API, to communicate [111,112,113,114]. REST API is mainly implemented on these protocols: HTTP(S), URI, JSON, and XML.
Although we are applying our current method in smart home and smart labs, but our goal is to extend our method and apply to smart campus, smart cities, smart factory, and smart grid/infrastructures.
Even though we only used 4 Pis, as seen in Figure 6a, the usage of such specific devices can be generalized. The behavior of the Rainbow HAT, as shown in Figure 6b resembles the characteristics of those smart devices that execute one-dimensional jobs, such as smart lights, smart thermometer, smart doorlocks without cameras.
Photo of Raspberry Pi and Rainbow HAT. (a) Shutdown status; (b) app running.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.