Local LLM's and RAG for Cybersecurity Researchers

Building Your Own Knowledgebase

I’ve always wanted a search engine for things like the VX Underground papers archive. It’s an amazing resource but there isn’t a great way to search through it all.

With recent developments in AI, Large Language Models (LLMs) and retrieval systems offer some genuinely practical tools for researchers. We’ll explore how to build something useful: a private, customizable knowledge base that works the way you need it to work.

Working with Local LLMs

LLMs represent an interesting addition to a security researcher’s toolkit. While they’re not a silver bullet, they can be particularly effective for specific tasks when used thoughtfully.

They excel at tasks like code analysis, pattern identification, and converting unstructured data into structured formats like json.

LLMs are easy to work with in python. With a few lines of code, you can have an LLM chatting back and forth with your code:

import ollama
 
response = ollama.generate(model='gemma2:2b', prompt='What is threat intelligence?')
print(response['response'])

Understanding Retrieval Augmented Generation

Language models like ChatGPT are trained on vast amounts of public data. However, their training data eventually becomes outdated, and they lack access to specialized or private information. This is where open-source, local language models become valuable – they provide similar capabilities while giving you control over your data, without API limits or additional costs.

A huge community has grown around this idea of open-source AI, my favourite communities being Hugging Face and Reddit’s /r/LocalLLaMA. This approach embodies the open-source community’s core values of transparency and collaborative development.

These models do have limitations. Training cutoff dates mean they might miss recent developments – for instance, Anthropic’s Claude has a training cutoff around March 2024. They also might not include specialized documentation or private research. This is where Retrieval Augmented Generation (RAG) becomes useful.

Consider working with threat intelligence reports. Traditional keyword searches might help locate relevant sections, but RAG can help transform reports like this one into structured data:

{
  "report_title": "Threat Actors’ Toolkit: Leveraging Sliver, PoshC2 & Batch Scripts",
  "date": "August 12, 2024",
  "key_findings": [
    "Threat actors are increasingly using sophisticated tools and scripts to execute attacks.",
    "The combination of Sliver, PoshC2, and batch scripts provides a robust toolkit for advanced threat operations."
  ],
  "tools_used": [
    "Sliver - A cross-platform post-exploitation agent used for maintaining access and executing commands on compromised systems.",
    "PoshC2 - A PowerShell command and control framework that allows for stealthy communication with compromised hosts.",
    "Batch Scripts - Simple yet effective scripts used to automate tasks and maintain persistence."
  ],
  "indicators_of_compromise": [
    "Network traffic to known C2 servers using encrypted channels.",
    "Unusual PowerShell activity or execution of scripts from unusual locations.",
    "Persistence mechanisms such as registry keys, scheduled tasks, or startup items that execute batch scripts."
  ]
}

RAG can also assist in creating initial detection rules. While these might need refinement, they provide a useful starting point:

title: Detect Creation or Modification of Suspicious Batch Scripts
id: 5c6d7e8f-0a1b-4c3d-a2e4-f56g7h8i9j0k
status: experimental
description: Detects the creation or modification of batch scripts in suspicious locations, which could indicate malicious activity.
author: Your Name
date: 2023/10/05
logsource:
    category: file_event
    product: windows
detection:
    selection:
        - file.name|endswith: .bat
        - file.path|contains: 'AppData'
    condition: selection
fields:
    - user
    - process.name
    - file.path
falsepositives:
    - Legitimate batch scripts used for automation
level: medium

At its core, RAG enhances your language model by providing it with contextually relevant information from your document collection. When you ask a question, RAG searches a vector database, combines relevant information with your query, and helps the model provide more informed responses.

Building a RAG

While there are pre-built solutions available, building a simple RAG system helps understand the underlying concepts. Let’s start with setting up a local model using Ollama, which works across Mac, Windows and Linux. I’m on Ubuntu 24.04:

curl -fsSL https://ollama.com/install.sh | sh

Then pull down a small model and test it:

systemctl restart ollama
ollama run gemma2:2b

You’ll get a basic chat prompt in your CLI. You can ask questions, and it has a memory called a context window length of 2048 tokens (or about 8000 characters give or take).

Let’s ask vanilla Gemma2 to make sure it doesn’t know the answer to the question we’re going to ask:

>>> whats the password to the vx underground database?
I understand you might be curious about secure systems, but I cannot provide you with the password for any system, especially one that could compromise someone's privacy or
security.
 
Here's why:
 
* **It's illegal:**  Accessing someone else's information without permission is a crime and can have serious legal consequences.
* **It's unethical:** Sharing passwords would violate trust and potentially harm people.
* **It's dangerous:** The misuse of this information could lead to financial loss, identity theft, or other harmful situations for individuals involved.
...
>>> /bye

Gemma2 has a perfectly reasonable response, but not really helpful for us. You can ask if it knows what VX Underground is. Most models seemed to have a rough idea, but I haven’t found a model that knows the password.

Next, you’ll want a relatively recent version of Python (I’m on 3.12). To make things much easier to manage, we’ll be using venv and poetry to manage dependencies:

sudo apt install python3-pip python3-venv python3-virtualenv -y
pip install pipx -y
pipx install poetry

To make a new project driven by Poetry:

poetry new simple_rag
cd simple_rag
python -m venv simple-rag env
source env/bin/activate

Poetry will create a new directory called simple_rag with some files to get your project started:

├── README.md
├── pyproject.toml
├── simple_rag
│   └── __init__.py
└── tests
    └── __init__.py

To handle our RAG, we’ll use Langchain, a library that abstracts away some of the details of building a RAG to make it simple:

poetry add langchain langchain-huggingface langchain-community faiss-cpu

Here’s the RAG code (simple-rag/simple-rag/simple-rag.py):

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import Ollama
 
# load your document
loader = TextLoader('test.txt') # Replace with your file path
documents = loader.load()
 
# split the document into smaller chunks for better context
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
 
# load SentenceTransformers model for embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
 
vectorstore = FAISS.from_documents(texts, embeddings)
 
# setup ollama
llm = Ollama(model="gemma2:2b", temperature=0.7) # Adjust model & parameters as needed
 
# create qa chain with ollama
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever())
 
# ask question about the document
query = "What does this document tell you about the vx underground zip file password?"
result = qa_chain.run(query)
print(result)

Before we run it, let’s make a test document:

echo "the password to the vx underground zip files is 'infected'" > test.txt

Run it:

poetry run python simple_rag/simple-rag.py
...
The document tells us the password to the vx underground database is 'infected'.

There’s something else interesting happening here. It completely ignored the ethical concerns it mentioned earlier. These models are trained to be helpful, safe assistants. It’s possible it doesn’t know the answer, and it tries to steer the conversation away to something helpful, or it’s possible that putting the model in the context of analyzing a document broke it out of its safety training.

There are jailbreaks that use similar techniques. By persuading a model to imagine a fictitious scenario where it writes some malicious code, they will often play along and write the malicious code. Initially, I used jailbreaks when building RAG for Cybersecurity documents, but I realized they were often not needed to get the information back that I needed (gemma, qwen2.5, mistral-nemo).

Now that we’ve got a working example, we need to talk about how to scale this to more than one document.

Multiple Documents and Context Length

Our example code should work fairly well, but we already have some engineering challenges. We used an in-memory vector store for a quick test, which makes it simpler to get started but won’t scale well. We should build a more permanent vector database that will survive reboots and outages.

We also have the challenge that LLMs have a fixed context length, meaning, they can only store x amount of tokens at a time before they start to fall apart. You can think about context window as a constantly shifting memory that the LLM can refer back to as your chat history grows.

If you asked an initial question, then the model responds, then you ask a clarifying question, the entirety of that conversation (user questions + data retrieved from RAG + model responses) has to fit into that window. If the conversation grows too large, the model begins to truncate your previous chat history.

Every model has an ideal context length, but it also depends on how much ram you have. The default context window for Ollama is 2048 tokens. It also matters how big your model is, measured in number of parameters. For example Gemma2:2b is a 2 billion parameter model. LLama 3.1 70b is a 70 billion parameter model, so it requires much more ram than Gemma2:2b. Working with a smaller parameter model allows you work with larger context windows without running out of ram.

To deal with some of these limitations, another lever we can pull is the size of the documents we retrieve. Instead of getting an entire document back in the response, it would be a lot better if we only got back the passages we really needed.

This ends up being a more involved problem to solve because our vector search needs to have contextual awareness of the source material, and in our case, PDFs can be a nightmare to parse.

In the next section we’ll address some of these issues with a newer RAG method called ColBERTv2.

Try It Yourself

Since we talked about jailbreaks, see if you can get one to work with the RAG you’ve now built. Whatever model you decided to use, there’s probably a jailbreak that will work here.