Building Your Own Knowledgebase
I’ve always wanted a search engine for things like the VX Underground papers archive. It’s an amazing resource but there isn’t a great way to search through it all.
With recent developments in AI, Large Language Models (LLMs) and retrieval systems offer some genuinely practical tools for researchers. We’ll explore how to build something useful: a private, customizable knowledge base that works the way you need it to work.
Working with Local LLMs
LLMs represent an interesting addition to a security researcher’s toolkit. While they’re not a silver bullet, they can be particularly effective for specific tasks when used thoughtfully.
They excel at tasks like code analysis, pattern identification, and converting unstructured data into structured formats like json.
LLMs are easy to work with in python. With a few lines of code, you can have an LLM chatting back and forth with your code:
Understanding Retrieval Augmented Generation
Language models like ChatGPT are trained on vast amounts of public data. However, their training data eventually becomes outdated, and they lack access to specialized or private information. This is where open-source, local language models become valuable – they provide similar capabilities while giving you control over your data, without API limits or additional costs.
A huge community has grown around this idea of open-source AI, my favourite communities being Hugging Face and Reddit’s /r/LocalLLaMA. This approach embodies the open-source community’s core values of transparency and collaborative development.
These models do have limitations. Training cutoff dates mean they might miss recent developments – for instance, Anthropic’s Claude has a training cutoff around March 2024. They also might not include specialized documentation or private research. This is where Retrieval Augmented Generation (RAG) becomes useful.
Consider working with threat intelligence reports. Traditional keyword searches might help locate relevant sections, but RAG can help transform reports like this one into structured data:
RAG can also assist in creating initial detection rules. While these might need refinement, they provide a useful starting point:
At its core, RAG enhances your language model by providing it with contextually relevant information from your document collection. When you ask a question, RAG searches a vector database, combines relevant information with your query, and helps the model provide more informed responses.
Building a RAG
While there are pre-built solutions available, building a simple RAG system helps understand the underlying concepts. Let’s start with setting up a local model using Ollama, which works across Mac, Windows and Linux. I’m on Ubuntu 24.04:
Then pull down a small model and test it:
You’ll get a basic chat prompt in your CLI. You can ask questions, and it has a memory called a context window length of 2048 tokens (or about 8000 characters give or take).
Let’s ask vanilla Gemma2 to make sure it doesn’t know the answer to the question we’re going to ask:
Gemma2 has a perfectly reasonable response, but not really helpful for us. You can ask if it knows what VX Underground is. Most models seemed to have a rough idea, but I haven’t found a model that knows the password.
Next, you’ll want a relatively recent version of Python (I’m on 3.12). To make things much easier to manage, we’ll be using venv
and poetry
to manage dependencies:
To make a new project driven by Poetry:
Poetry will create a new directory called simple_rag
with some files to get your project started:
To handle our RAG, we’ll use Langchain, a library that abstracts away some of the details of building a RAG to make it simple:
Here’s the RAG code (simple-rag/simple-rag/simple-rag.py):
Before we run it, let’s make a test document:
Run it:
There’s something else interesting happening here. It completely ignored the ethical concerns it mentioned earlier. These models are trained to be helpful, safe assistants. It’s possible it doesn’t know the answer, and it tries to steer the conversation away to something helpful, or it’s possible that putting the model in the context of analyzing a document broke it out of its safety training.
There are jailbreaks that use similar techniques. By persuading a model to imagine a fictitious scenario where it writes some malicious code, they will often play along and write the malicious code. Initially, I used jailbreaks when building RAG for Cybersecurity documents, but I realized they were often not needed to get the information back that I needed (gemma, qwen2.5, mistral-nemo).
Now that we’ve got a working example, we need to talk about how to scale this to more than one document.
Multiple Documents and Context Length
Our example code should work fairly well, but we already have some engineering challenges. We used an in-memory vector store for a quick test, which makes it simpler to get started but won’t scale well. We should build a more permanent vector database that will survive reboots and outages.
We also have the challenge that LLMs have a fixed context length, meaning, they can only store x amount of tokens at a time before they start to fall apart. You can think about context window as a constantly shifting memory that the LLM can refer back to as your chat history grows.
If you asked an initial question, then the model responds, then you ask a clarifying question, the entirety of that conversation (user questions + data retrieved from RAG + model responses) has to fit into that window. If the conversation grows too large, the model begins to truncate your previous chat history.
Every model has an ideal context length, but it also depends on how much ram you have. The default context window for Ollama is 2048 tokens. It also matters how big your model is, measured in number of parameters. For example Gemma2:2b is a 2 billion parameter model. LLama 3.1 70b is a 70 billion parameter model, so it requires much more ram than Gemma2:2b. Working with a smaller parameter model allows you work with larger context windows without running out of ram.
To deal with some of these limitations, another lever we can pull is the size of the documents we retrieve. Instead of getting an entire document back in the response, it would be a lot better if we only got back the passages we really needed.
This ends up being a more involved problem to solve because our vector search needs to have contextual awareness of the source material, and in our case, PDFs can be a nightmare to parse.
In the next section we’ll address some of these issues with a newer RAG method called ColBERTv2.
Try It Yourself
Since we talked about jailbreaks, see if you can get one to work with the RAG you’ve now built. Whatever model you decided to use, there’s probably a jailbreak that will work here.