Semantic AI search with RAG, Qdrant & Ollama on macOS

In an increasingly confusing world of information, it is becoming more and more important to make your own databases searchable in a targeted manner - not via classic full-text searches, but through semantically relevant answers. This is exactly where the principle of the RAG database comes into play - an AI-supported search solution consisting of two central components:

Current topics on artificial intelligence

Immortality through technology: how far research and AI have really come

The new EU censorship laws: What Chatcontrol, DSA, EMFA and the AI Act mean

Apple MLX vs. NVIDIA: How local AI inference works on the Mac

a vector database (such as Qdrant) in which any content is stored as numerical vectors,
and a language model (e.g. via Ollama) that intelligently combines the respective requests with the appropriate content.

Instead of letting the model "guess", this architecture uses its own sources of knowledge - such as:

self-written documentaries,
Contents of websites,
technical manuals,
Support databases,
FAQ lists,
or any archived text sources (e.g. from old databases).

The decisive factor: All these sources can be prepared in advance and "chunked" (i.e. broken down into small text units) in order to provide the most relevant text excerpts for a user question later on.

So whether you want to make your own knowledge database, internal documentation or an entire product archive analyzable - with Ollama + Qdrant you can do this on your own Mac, without any cloud constraints and with full control over the data.

What is a RAG database - and why "chunking" at all?

RAG stands for Retrieval-Augmented Generation - in other words: text-generating AI with assisted information retrieval. Instead of training a language model such as GPT, Mistral or LLaMA only on what it already "knows", it can access additional, proprietary information via a connected knowledge database (usually a so-called vector database).

Example:

If you ask a language model: "What is in my 2023 tax return?", it will have to guess without access to the original data. However, if it has access to a locally stored, vector-based representation of this document, it can retrieve the relevant information and incorporate it into its answer.

Why content is "chunked"

Documents, websites or books are usually far too long to be processed or searched in one go. Modern language models also have token limits - i.e. a limited length of text that they can understand at once (often around 4,000-8,000 tokens, with newer models even 32,000 or more).

That is why RAG uses the following trick:

The original text is divided into small sections (chunks).
Each chunk is converted into a vector by a language model (embedding).
These vectors are stored in a database such as Qdrant.
When the user makes a request, the prompt is also translated into a vector - and the most similar chunks are retrieved.
This content is then added to the language model - e.g. via a system prompt or context injection.

This creates a system that behaves like a memory - but without classic keywords or full-text search, but purely meaning-based (semantic).

Requirements and goal

We are building a local RAG system consisting of:

a local LLM via Ollama
a vector database called Qdrant
a Python script that chunks, vectorizes and inserts texts into the database
Optional: a simple interface or API for querying

Target platform: macOS (Intel or Apple Silicon)

This is a prerequisite:

macOS 12 or newer (Monterey or higher)
Basic terminal knowledge
Python 3.10 or newer
Optional: Homebrew installiert

Step 1: Ollama 1TP12Animal

Ollama is a lean tool that allows you to run local language models such as Mistral, LLaMA, Gemma or Codellama on your own computer - even without the Internet.

Installation on the Mac:

curl -fsSL https://ollama.com/install.sh | sh

Alternatively, Ollama can also be activated via Homebrew install:
brew install ollama

After the installation:

ollama run mistral

This downloads the Mistral 7B model and starts it locally. Ollama comes with a REST API, which we will use later for vectorization. You can of course also use other models such as Gemma3 (12B), Mistral Small (24B) or other LLMs.

Step 2: Qdrant 1TP12 animals (local vector database)

Qdrant is a lightning-fast vector database based on Rust. It is free, open source and easy to start on the Mac - preferably via Docker. If you have not yet installed Docker on your Mac 1TP12, you can download it from the Docker website free of charge and install it on your Mac as a normal desktop app installieren. Alternatively, you can also install Docker via Homebrew install if you are already using Homebrew:

brew install --cask docker

Then start Qdrant via Docker:

docker run -p 6333:6333 -v qdrant_storage:/qdrant/storage qdrant/qdrant

Qdrant can then be reached at:

http://localhost:6333

For testing:

curl http://localhost:6333/collections

Step 3: Prepare the Python environment

We need Python for chunking, embedding and communication with Qdrant.

Preparation:

python3 -m venv rag-env
source rag-env/bin/activate
pip install qdrant-client sentence-transformers ollama numpy

If ollama is not recognized as a Python package, use the REST API directly via requests:

pip install requests

Step 4: Chunking and embedding

Below you will find an example script that splits a text document into chunks, creates embeddings via Ollama and inserts them into Qdrant:

import requests
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
import uuid

# Konfiguration
CHUNK_SIZE = 500 # Zeichen
COLLECTION_NAME = "mein_rag_wissen"

# Text vorbereiten
with open("mein_text.txt", "r") as f:
text = f.read()

chunks = [text[i:i+CHUNK_SIZE] for i in range(0, len(text), CHUNK_SIZE)]

# Qdrant vorbereiten
client = QdrantClient("localhost", port=6333)

# Neue Collection anlegen (falls noch nicht vorhanden)
client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config={"size": 4096, "distance": "Cosine"}
)

def get_embedding_ollama(text):
response = requests.post(
"http://localhost:11434/api/embeddings",
json={"model": "mistral", "prompt": text}
)
return response.json()["embedding"]

# Embeddings erzeugen und in Qdrant speichern
points = []
for i, chunk in enumerate(chunks):
vector = get_embedding_ollama(chunk)
points.append(PointStruct(
id=str(uuid.uuid4()),
vector=vector,
payload={"text": chunk}
))

client.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"{len(points)} Chunks erfolgreich eingefügt.")

Step 5: Queries via semantic search

You can now send queries to Qdrant as a vector and have the most relevant text sections found:

query = "Wie funktioniert ein RAG-System?"
query_vector = get_embedding_ollama(query)

results = client.search(
collection_name=COLLECTION_NAME,
query_vector=query_vector,
limit=3
)

for r in results:
print(r.payload["text"])

You can then pass these chunks to Ollama via a system prompt, for example, and have them formulated as a context-related response.

Chunking + JSON export to FileMaker and other databases

In many cases, chunking can already take place in an existing database solution - in FileMaker, for example. This is exactly how it works in my own working environment: the source data - such as website content, support entries or technical articles - is already available in structured form in FileMaker.

This is how the process works:

The texts within FileMaker are divided into sections of e.g. 300-500 characters using their own chunking logic.
Each chunk is given its own ID and, if applicable, metadata (title, category, source, language, etc.).
All chunks are automatically exported as JSON files - e.g. to a specific directory on a network drive or directly to the hard disk of the AI server.
A Python script on the server reads these JSON files and saves them in the Qdrant database.

Example of an exported chunk file (chunk_00017.json)

{
"id": "00017",
"text": "Dies ist ein einzelner Textabschnitt mit ca. 400 Zeichen, der aus einer größeren Quelle stammt. Er wurde in FileMaker vorbereitet und enthält alle relevanten Inhalte, die für eine semantische Suche benötigt werden.",
"metadata": {
"source": "support_center",
"category": "Fehlermeldung",
"language": "de",
"title": "Drucker wird nicht erkannt"
}
}

The subsequent import script can then be executed automatically or regularly via the terminal - e.g. via a cron job or manual call:

python3 import_json_chunks.py /Users/markus/Desktop/chunks/

The script reads each JSON chunk, generates the corresponding vector (e.g. via Ollama or SentenceTransformers) and transfers the entry to the Qdrant database.

This method is not only transparent, but can also be combined very well with existing IT structures - especially in companies that already use FileMaker or like to map everything centrally and visually controlled for reasons of process clarity.

Connect any databases to your local AI

With Ollama and Qdrant, a complete, high-performance RAG system can be set up on the Mac in a short time:

Local, without cloud or subscription
Expandable, with your own content
Data secure, as nothing leaves the computer
Efficient, as Qdrant remains fast even on large amounts of data

If you want to use your AI not just for chatting, but as a real knowledge and memory system, this combination is a must. And it works with little effort - with full control over your own data.

Current survey on the use of local AI systems

Outlook: What is possible with RAG, Ollama and Qdrant

The setup described in this article forms the technical basis for a new way of dealing with knowledge - local, controlled and flexibly expandable. But the journey by no means ends there. Once you have understood the interplay of chunking, embedding, semantic search and language models, you will quickly realize how versatile this architecture is in practice.

1. connection to own databases

Whether FileMaker, MySQL, PostgreSQL or MongoDB - any content can be regularly extracted, chunked and automatically inserted into the vector database using targeted queries. This turns a classic database into a semantically searchable source of knowledge. Especially in support systems, product archives or digital libraries, this opens up completely new access options for employees or customers.

2. automatic import of web pages, PDFs or documents

Content does not have to be transferred manually. With tools such as BeautifulSoup, readability, pdfplumber or docx2txt, entire websites, PDF manuals or Word documents can be automatically imported, converted into text form and prepared for chunking. For example, technical wikis, customer portals or online documentation can be regularly updated and fed into the RAG database.

3. long-term knowledge building through structuring

In contrast to a classic AI application, which starts from scratch with every question, a RAG setup allows the step-by-step expansion and curation of the underlying knowledge. The targeted selection and preparation of chunks creates a semantic memory of its own, which becomes more valuable with each entry.

4. connection with knowledge graphs (Neo4j)

If you want to go one step further, you can not only store the information semantically, but also link it logically. With Neo4j, a graph database, relationships between terms, people, topics or categories can be visualized and specifically queried. This turns a collection of texts into a structured knowledge graph that can be used by both humans and AI - e.g. to visualize causal chains, temporal sequences or thematic clusters.

5. use in your own tools, apps or chatbots

Once set up, the RAG logic can be integrated into almost any application: as a semantic search function in an internal web app, as an intelligent input aid in a CRM system or as a chatbot with its own expertise on the company website. By using local APIs (e.g. Ollama REST and Qdrant gRPC), all components remain flexible and expandable - even beyond traditional company boundaries.

Those who have the courage to familiarize themselves with these tools create the basis for independent, local AI systems with real utility value - in the spirit of control, sovereignty and technical clarity.

Current topics around ERP software

Overview of the obligation to issue electronic invoices

Electronic invoices for SMEs: XRechnung, ZUGFeRD and ERP at a glance

Digital dependency: how we have lost our self-determination to the cloud

gFM-Business and the future of ERP: local intelligence instead of cloud dependency

Frequently asked questions about RAG with Ollama + Qdrant

1. what is a RAG database - and what is it good for?

A RAG database (Retrieval Augmented Generation) combines a vector database with a language model. It makes it possible to make your own content - e.g. documentation or websites - semantically searchable so that AI models can specifically access relevant sections of your own database.

2. what does "chunking" mean in this context?

Chunking means breaking down long texts into smaller, meaningfully coherent sections (chunks) - usually between 200 and 500 characters. This allows individual text sections to be saved efficiently in the vector database and retrieved later when questions arise.

3. why can't you simply save entire texts in Qdrant?

Because AI models and vector searches work with limited text lengths. Large documents would "hide" important content or make it inaccurate. Chunking increases accuracy because specific sections are compared instead of complete texts.

4. can I use content from any source?

Yes, as long as you have the texts in an editable form (e.g. as plain text, HTML, Markdown, PDF, FileMaker entries, etc.), you can prepare them, chunk them and integrate them into Qdrant. Mixed sources are also possible.

5. do I have to be able to program to build such a system?

Basic knowledge of Terminal and Python is helpful, but not essential. Many steps (e.g. chunking in FileMaker, JSON export) can be implemented visually and automatically. The Qdrant import script can be easily customized.

6. can I also manage several documents or categories?

Yes, each chunk can contain metadata - e.g. title, source, language or category. These can be taken into account during the search in order to filter results more specifically.

7 Which models are suitable for embedding generation?

You can either use a local model via Ollama (e.g. mistral, llama2, gemma) or a separate embedding model such as all-MiniLM from sentence-transformers. It is important that the model generates embedding output as vectors.

8. how do I start Qdrant on the Mac?

The easiest way is via Docker command:

docker run -p 6333:6333 -v qdrant_storage:/qdrant/storage qdrant/qdrant

Qdrant then runs under http://localhost:6333

9. how large can my data volumes be?

Qdrant is very performant and can easily manage tens or hundreds of thousands of chunks. The main limitation is RAM and storage space, not the number.

10. does this also work with FileMaker?

Yes, you can do all the chunking and JSON export directly in FileMaker. The chunks are exported as individual JSON files, which are then imported into Qdrant via a Python script - completely independently of the original system.

11. can I also run this on another server instead of Mac?

Absolutely. The setup also works on Linux servers, Raspberry Pi, or in the cloud (if desired). Docker makes it platform-independent. For productive use, a server with more RAM and GPU support is usually recommended.

12. how do I combine the vector search with Ollama?

You first create a vector for a user question via Ollama (Embedding API), use it to search for the most relevant chunks in Qdrant and give these to the language model as context. Ollama then processes the question + context-relevant information and generates a well-founded answer.

Image (c) geralt @ pixabay

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.