Malicious Codes and Where to Find Them
Table of Contents
data:image/s3,"s3://crabby-images/a36b7/a36b791e7cd5f5f563c737704e765907268646c7" alt="Malicious code"
tldr; Malicious code is everywhere, but it's dangerous and hard to find. Here are some of the ways one might find it, or create it.
⚠️ Disclaimer: Malicious code (MC) analysis is important for cybersecurity, but it requires careful handling. We tend to keep it hidden, but it's out there (unfortunately, we also tend to find it when we least expect it). When performing malicious code analysis, proper security protocols are critical--including the use of isolated virtual machines, detonation zones, secure test environments, etc. to prevent accidental execution or system compromise.
Introduction
I’ve got a project where I need malicious code to test with. Not to break into systems, but to test software that is supposed to detect malicious code.
BUT…malicious code doesn’t grow on (binary) trees, it’s hard to find online and in Github repositories, although there seems to be a lot of it.
Example: Github besieged by millions of malicious repositories in ongoing attack
Options for Finding Malicious Code
- Write it yourself
- Possibly using YARA rules to reverse engineer test code
- Find it on the “Internet”, e.g.
- Find open caches of MC
- Go to the backwaters of the internet
- Search Github for specific known MC code patterns
- Use a large language model (LLM) to generate it
Each of these has its own set of problems.
- You can write MC, but it won’t be as diverse, and it won’t look like real MC.
- Finding MC on the internet is difficult, and sometimes you end up in the various backwaters of the internet, i.e. the pages that don’t show up in search results.
- Most LLMs will refuse to generate MC, or will generate code that is so obviously MC that it is unusable.
Finding it on the Internet
An example: theZoo, which fits well with my monster theme. As I said, it’s hard to find MC on the internet, but there must be collections of it out there, and this is one of them.
theZoo is a project created to make the possibility of malware analysis open and available to the public. Since we have found out that almost all versions of malware are very hard to come by in a way which will allow analysis, we have decided to gather all of them for you in an accessible and safe way. theZoo was born by Yuval tisf Nativ and is now maintained by Shahak Shalev. - theZoo
Searching Github
Topics:
Searching for patterns with Sourcegraph:
There is of course much more that could be done here, these are just a few examples.
Use a Large Language Model (LLM)
LLMs are good at generating code, but they usually have a lot of guardrails and other restrictions that prevent them from generating code that is actually malicious. There are ways to trick them into generating MC, but it’s probably against the terms of service of the LLM vendor, and it’s extra work.
Here’s a simple example with Claude, just to show the standard guardrail response.
However, if we had an LLM without all the guardrails, and perhaps even trained in malicious code, we would be able to generate more realistic malicious code.
Enter White Rabbit Neo.
WhiteRabbitNeo is a Generative AI Large Language Model (LLM) designed to support DevSecOps professionals in use cases for offensive and defensive cybersecurity, secure infrastructure design and automation, and more…While the foundation AI models (from OpenAI, Anthropic, Google, and Meta) censor their outputs for many security use cases, WhiteRabbitNeo is uncensored and trained to act like a modern adversary. It has a deep knowledge of threat intelligence, software engineering, even infrastructure as code, making it ideally suited for DevSecOps tasks and offloading tedium. - White Rabbit Neo
We can ask it to generate some malicious code.
Well, what it generates isn’t that great, but at least it provides something. One could imagine what nation states and other large, resourceful organisations are doing to train LLMs to generate more realistic malicious code, the caches of MC they have, the resources to train and fine-tune LLMs–they could potentially come up with some interesting MC. (Although, again, code quality, level of deception and sophistication would still be a problem, even now).
Conclusion
This is just a brief look at how one might track down some MC/malware to test with.
I find the lack of examples of malicious code really interesting–how do we as an industry get better at stopping, or at least detecting, malicious code without a lot of public examples? Obviously, the reason is that we don’t want to teach attackers how to write malicious code (that said, I think they are still learning). Of course, this problem has existed almost since we invented programming, and nothing is likely to change, which is fine–it’s just an interesting problem. I’m not suggesting we make it easier for attackers to write malicious code by providing them with more examples–just that “hey, it’s an interesting problem space”.
🤔 One question I have is with regard to LLMs is: Will LLMs trained on public source code repositories that do NOT contain malicious code be able to produce code that is secure? Without the inverse example of secure code?