Towards Using NoSQL for Honeypots
While one might argue that the development and interest in honeypots have decreased, there seems to be novel interest in getting these systems to work at scale and in more automated manners. This study attempts to deliver a higher-level overview of the data generated of a period of time by a medium interaction honeypot running cowrie, a free and open-source honeypot that is currently very popular as well as regularly updated. The research is based on a dataset consisting of approximately 7.7 million events spanning almost four months. Some rudimentary tools were developed to transform the log files from a text-based origin to a MongoDB database, and to facilitate a performant and simple communication to the log files located on said database. Both quantitative and qualitative measurements were considered for this task. The research concludes with some suggestions relating to how this project can be taken further and how the data analysed from this project can give indications as to what areas should be further researched in separate studies.
With a steady pace to move older infrastructure to the web comes a different set of problems that have to be considered. Documents that used to be locked into a cabinet where only a few managers had the key can now potentially be accessed if someone is using default credentials to “secure” the system. To assess the threat image honeypots may be used to, in various degrees of interaction, probe the internet for this information. This way, one may understand more of what one is up against and possibly even stop attacks before they do much harm, by the help of precursor attack analysis (Ahmad, 2015). The problem with this is that the amount of data generated by such systems can be overwhelming. Hence, we need novel systems better suited for the analysis of data at this scale. In this study, an attempt will be made to demonstrate the importance of efficient data handling through database systems, in the case of this task: MongoDB.
Prior to any analysis done of the data, a lot of effort was put into planning the most efficient way the data could be queried and manipulated. Some of the technologies that were considered instead of MongoDB were SQLite and PostgreSQL. While these technologies could be used, as the data is relatively structured, the speed and ease that documents (a term used for fields or entities in MongoDB) can be either inserted or retrieved. ACID compliance was not of high importance either (Hammes, 2014). The full dataset was 8.1GB (7.7 million documents), with a date range from 30-03-2019 to 21-07-2019. What should be noted is that the data from the honeypot is not entirely continuous, meaning that some days between the start and end of the honeypot recording are missing. Such is mostly seen in 07-2019. After this, the range of the dataset could be determined, as the size of it was dependant on how performant the chosen technology was. Due to the success of the initial research in performant queries, the entirety of the cowrie dataset was possible to be analysed. Due to the size of the dataset, compromises had to be made as it was clear that some analysis that was planned would not be feasible. An analysis of all reported binaries would be beneficial, but it was deemed unfeasible due to the size of the dataset and the many limiting factors (such as low requests per second) of potential APIs (like VirusTotal). As most of the documents already contained geolocalisation data, the effort was put to structuring this data and presenting a higher-level perspective. Despite that, some tasks used a GeoLite2 database (MaxMind, 2019) to get geolocalisation data for some of the analysis.
A graph was plotted showing attack attempts against units of time showing attempted attacks against the honeypot. This graph gives a clear insight into where the attacks accumulate depending on the time of the attack.
Therefore, it is evident that attacks usually happen in bursts, which is as expected as the majority of attempts (especially automated ones) will most likely occur in one continuous action.
Username No. of occurences
- root 236070
- admin 15654
- enable\x00 9403
- shell\x00 9331
- (empty string) 3937
- default 2804
- support 1800
- user 1344
- guest 1064
- pi 852
Password No. of occurences
- admin 163971
- system\x00 9451
- sh\x00 9091
- 7ujMko0admin 7830
- xc3511 6556
- default 6391
- 12345 6251
- vizxv 5553
- password 5294
- root 4745
Username and password
username:password No. of occurences
- root:admin 161519
- enable\x00:system\x00 9359
- shell\x00:sh\x00 9091
- root:7ujMko0admin 7166
- root:xc3511 6490
- root:vizxv 5488
- root:default 5407
- root:12345 4874
- root:password 4067
- root:hunt5759 3958
While, as stated earlier, most documents contained geolocalisation data, some analysis was done using the GeoLite2 database (MaxMind, 2019) through a script written by Raidan Campbell (Campbell, 2019), but modified to work with MongoDB instead of CSV files.
Graph showing number of attack attempts relative to country.
Graph showing where bad IPs come from.
Default usernames and passwords is an old problem (Howard, 2005), which one might think would be gone by now. Alas, this is not the case (Ling, 2017) and the rampant use of them in the internet of things (IoT) devices is only part of an old habit that needs to cease to exist for security’s sake. While many usernames and passwords seem trivial to guess, such as “root”, “pi” and “admin”, some are a bit less trivial as to why they are popular, like “xc3511”. Many of these credentials even repeat themselves in other scientific journals (Hendriks, 2017). A recurring theme seems to deal with default credentials for the Mirai botnet (Van der Elzen, 2017). This should come as no surprise as default passwords in IoT devices have been exploited heavily in the recent times as its use is infamously widespread (Antonakakis, 2017), with Mirai being just one of the entities uses this “exploit”. With all this considered, it is no wonder that the usage of default credentials is still a matter of research and debate (Knieriem, 2017). Interesting to note is that several of both usernames and passwords contained null-characters (\x00), while there a range of different answers to be found why this is so, one phenomenon described by Anthony Ferrara where combining BCrypt with other cryptographic functions can yield an application insecure to null-character based attacks.
According to the figure, one may see that many of the attacks are done from a small subset of the IP addresses, with one IP address (220.127.116.11) dominating in the number of attempted attacks. Compared to the total number of connections from this IP address, being 2546654, one can that the amount of successful logins dwarfs in comparison at 32718, which amounts to approximately 1.28% of the observed data.
With the number of attacks against the cowrie honeypot, namely the SSH and telnet services being emulated, one might think that the ports these services would be attacked at would be strongly represented; even in the first place. However, port 443 (HTTPS) and port 25 (SMTP) is the first and second most hit ports, respectively, as can be seen in the figure. Even more interesting is that most of the connections attempting to connect to the SMTP service are coming from a similar range of IP addresses (18.104.22.168/24 or 22.214.171.124/24) with very few other connections being made. Port 22 (SSH) is, in fact, the third most hit port, which one may reason that this is because these attacks are mostly (if not almost exclusively) automated with very little human interaction needed.
Some URLs, which contains binaries, were defined as malware by VirusTotal in four out of 68 cases (VirusTotal, 2019a) for the one, and three out of 67 cases for another (VirusTotal, 2019b). There are other instances of either URLs or IP addresses which VirusTotal defines as malicious (VirusTotal, 2019c). While an analysis of all reported binaries would be beneficial, it was deemed unfeasible due to the size of the dataset and the many limiting factors (such as low requests per second) of potential APIs (like VirusTotal).
The US, Ireland, Singapore and Russia is seen to lead in the number of attacks that can be attributed to these countries compared to all other countries relative to connections. The distribution is, however, slightly different when it comes to sources that are deemed malicious by AlienVault from threatfeeds.io (Threatfeeds, 2019), as displayed on the figure. From the analysis of potential attackers, the US is still the largest. However, the countries after this are very different comparing to the total amounts of connections, namely China, South Korea, Taiwan and Japan. While these are indeed four separate countries, it is interesting that these connections are from different locations in so close geographical proximation to each other.
One of the larger, and frequently updated, datasets available from threatfeed.io is the “Alienvault IP Reputation” dataset which contains over 200000 IP addresses with a quantitative measurement of how suspicious it is deemed. Using this data, a script was written to determine which country had a higher output of the malicious activity. In the figure, it can be seen that some countries dominate the statistics, and those countries are not necessarily related to how the general activity coming from said countries.
It can be seen that a database can aid in the analysis of larger datasets. The performance enhancements are especially helpful when wanting a higher-level overview of the data without resorting to splitting one’s data into several files, as more traditional methods using, for example, plain CSV files would dictate. From the data analysis, one can also see that the dangers of using default credentials are still a problem that is still serious, especially as IoT devices are often vulnerable to such attacks, according to research. Some of the analysis may also indicate that these attacks are highly automated and often associated with known malware such as Mirai.
Primarily, improvements upon already presented methods that would yield a lower-level and more complete analysis of the data would be beneficial. To achieve this, however, processing speeds would have to be improved dramatically in addition to even more performant hardware. Such performance enhancements would need to scale so that it could also cope with an even larger dataset. Multithreading and/or multiprocessing would be obvious candidates to achieve this task.
Ahmad, A., Maynard, S. B., & Shanks, G. (2015). A case analysis of information systems and security incident responses. International Journal of Information Management , 35 (6), 717– 723.
Antonakakis, M., April, T., Bailey, M., Bernhard, M., Bursztein, E., Cochran, J.,… Kallitsis, M., et al. (2017). Understanding the mirai botnet. In 26th usenix security symposium (usenix security 17) (pp. 1093–1110).
Campbell, R. (2017). Retrieved from https://github.com/raidancampbell/Cowrie-Analyzer
Hammes, D., Medero, H., & Mitchell, H. (2014). Comparison of nosql and sql databases in the cloud. Proceedings of the Southern Association for Information Systems (SAIS), Macon, GA , 21–22.
Hendriks, C. (2017). Fixing the average internet user’s iot vulnerabilities. Howard, M., LeBlanc, D., & Viega, J. (2005). 19 deadly sins of software security. Programming Flaws and How to Fix Them.
Knieriem, B., Zhang, X., Levine, P., Breitinger, F., & Baggili, I. (2017). An overview of the usage of default passwords. In International conference on digital forensics and cyber crime (pp. 195–203). Springer.
Ling, Z., Luo, J., Xu, Y., Gao, C., Wu, K., & Fu, X. (2017). Security vulnerabilities of internet of things: A case study of the smart plug system. IEEE Internet of Things Journal , 4 (6), 1899–1909.
MaxMind. (2019). Geolite2 free downloadable databases. Retrieved from https://dev.maxmind.com/geoip/geoip2/geolite2/
Threatfeeds. (2019). Free threat intelligence feeds - threatfeeds.io. Retrieved from https://threatfeeds.io/
Van der Elzen, I., & van Heugten, J. (2017). Techniques for detecting compromised iot devices. University of Amsterdam.
VirusTotal. (2019a). Virustotal. Retrieved from LINK
VirusTotal. (2019b). Virustotal. Retrieved from LINK
VirusTotal. (2019c). Virustotal. Retrieved from LINK