Implementing HNSW (Hierarchical Navigable Small World) Vector Search in PHP

(centamori.com)

42 points | by centamiv 2 hours ago

3 comments

centamiv 2 hours ago
OP here. I wrote this implementation to deeply understand the mechanics behind HNSW (layers, entry points, neighbor selection) without relying on external libraries. While PHP isn't the typical choice for vector search engines, I found it surprisingly capable for this use case, especially with JIT enabled on PHP 8.x. It serves as a drop-in solution for PHP monoliths that need semantic search features without adding the complexity of a separate service like Qdrant or Pinecone. If you want to jump straight to the code, the open-source repo is here: https://github.com/centamiv/vektor Happy to answer any questions about the implementation details!
[-]
- lukan 17 minutes ago
  Thanks a lot, I liked the fantasy based examples to explain the concept.
  Programming is chanting magic incarnations and spells after all. (And fighting against evil spirits and demons)
- hilti 14 minutes ago
  Great article! I also read your other post and love it! This is exactly my thinking: Locality of Behavior (LoB)
  Never heard this term before, but I like it.
  https://centamori.com/index.php?slug=basics-of-web-developme...
- hu3 1 hour ago
  Great writeup. Thanks for talking the time to organise and share.
  It's tempting to use this in projects that use PHP.
  Is it useable with a corpus of like 1.000 3kb markdown files? And 10.000 files?
  Can I also index PHP files so that searches include function and class names? Perhaps comments?
  How much ram and disk memory we would be talking about?
  And the speed?
  My first goal would to index a PHP project and its documentation so that an LLM agent could perform semantic search using my MCP tool.
  [-]
  - centamiv 52 minutes ago
    I tested it myself with 1k documents (about 1.5M vectors) and performance is solid (a few milliseconds per search). I haven't run more aggressive benchmarks yet.
    Since it only stores the vectors, the actual size of the Markdown document is irrelevant; you just need to handle the embedding and chunking phases carefully (you can use a parser to extract code snippets).
    RAM isn't an issue because I aim for random data access as much as possible. This avoids saturating PHP, since it wasn't exactly built for this kind of workload.
    I'm glad you found the article and repo useful! If you use it and run into any problems, feel free to open an issue on GitHub.
- Random09 48 minutes ago
  The only small thing you forgot to mention - it requires use of AI. Open Ai to be specific. I've got baited.
  [-]
  - centamiv 42 minutes ago
    Apologies if it felt that way! I used OpenAI in the examples just because it's the quickest 'Hello World' for embeddings right now, but the library itself is completely agnostic.
    HNSW is just the indexing algorithm. It doesn't care where the vectors come from. You can generate them using Ollama (locally) HuggingFace, Gemini...
    As long as you feed it an array of floats, it will index it. The dependency on OpenAI is purely in the example code, not in the engine logic.
rvnx 37 minutes ago
Cool blog post, smart guy, very thoughtful and not a copy-paste of Python code like 99% of folks. Nice to see
[-]
- centamiv 30 minutes ago
  Thank you, really appreciate that
fithisux 1 hour ago
It makes perfect sense to implement it in a high level language that allows understandability.
Very good contribution.
[-]
- centamiv 1 hour ago
  Thank you! That was exactly the goal. Modern PHP turned out to be surprisingly expressive for this kind of 'executable pseudocode'. Glad you appreciated it!