Vector Databases: Getting Started

A vector database is a system designed to store and retrieve high-dimensional vectors through similarity search. These vectors are numerical representations of different types of data, such as text, images, audio, and video. In practice, vector databases are primarily used to store embeddings, which are numerical representations generated by machine learning models that preserve the semantic characteristics of the data. As a result, semantically similar objects tend to be represented by vectors that are close to one another in the vector space, enabling fast and efficient similarity searches. Before continuing, I highly recommend reading my post about embeddings, as this concept is fundamental to understanding how vector databases work.

The need for a vector database arises because traditional databases, such as SQL and NoSQL databases, were not designed to perform semantic similarity searches. Although they excel at storing and querying structured data through exact-match operations, such as equality, range queries, or keyword searches, they are not designed to retrieve information based on its meaning or context. In contrast, vector databases store embeddings and perform similarity searches over these vectors, allowing them to retrieve results that are semantically related to the user’s query, even when different words or expressions are used.

Vector databases are widely used in applications such as Retrieval-Augmented Generation (RAG), embedding-based recommendation systems, semantic search engines, image similarity search, and many other solutions that rely on the efficient retrieval of semantically similar content.

In this post, I will cover the following topics:

Key characteristics of a vector database
How to use ChromaDB
How to use Pinecone
How to use Qdrant
Other widely used vector databases

Key characteristics of a vector database

Although each vector database has its own implementation, most of them share a very similar architecture. Data is typically organized into collections (or equivalent structures, depending on the technology), which can be thought of as analogous to tables in a relational database. Each collection groups data belonging to the same domain, use case, or dataset. Within each collection, records are stored (also referred to as documents, objects, or points, depending on the database). Each record typically contains a unique identifier, the corresponding embedding, and a set of associated metadata, such as category, author, creation date, or any other attribute relevant to the application.

Beyond this storage structure, vector databases typically provide several features that make similarity search more efficient, accurate, and scalable:

Scalability: As the volume of data grows to millions or even billions of vectors, it becomes necessary to distribute storage and processing across multiple servers. To achieve this, vector databases typically provide mechanisms such as sharding, which partitions data across different cluster nodes, and replication, which maintains multiple copies of the data to improve availability and fault tolerance. These mechanisms distribute the workload and enable low-latency queries, even at massive scale.
Vector indexing: One of the key characteristics of vector databases is the use of specialized indexes to accelerate similarity searches. Instead of comparing the query embedding against every stored vector, the database relies on optimized indexing structures that significantly reduce the search space. In practice, this represents a trade-off between accuracy and performance: a small amount of precision is sacrificed in exchange for substantial speed improvements. Rather than guaranteeing the exact nearest neighbor, these algorithms aim to find a sufficiently close result in a fraction of the time required for an exhaustive search. This approach is known as Approximate Nearest Neighbor (ANN) and is widely adopted by modern vector databases. One of the most popular ANN algorithms is Hierarchical Navigable Small World (HNSW), which I will cover in a dedicated post.
Metadata management: In addition to embeddings, vector databases also store metadata associated with each record. This metadata enables deterministic filtering during queries, allowing results to be restricted based on criteria such as category, language, creation date, or author. As a result, similarity search can be combined with structured filters, making retrieval more precise and better aligned with business requirements. I also plan to cover this topic in a dedicated post.
Querying and data retrieval: To perform a search, the user-provided content (text, image, audio, among others) is converted into an embedding using the same model that was used during data indexing. The resulting query vector is then compared against the indexed vectors using similarity metrics such as cosine similarity (which measures the angle between vectors) and euclidean distance (which measures the distance between them). Finally, the database returns the most similar records, along with their associated metadata.

Choosing a vector database depends directly on the application’s requirements. For personal projects, prototypes, or smaller applications, simpler solutions are often sufficient. On the other hand, systems that store millions or even billions of vectors and handle a high volume of concurrent queries require solutions that provide efficient indexing, scalability, data distribution, and high availability.

How to use ChromaDB

ChromaDB is an open-source vector database known for its ease of installation and use. Designed to help developers build semantic search applications and embedding-based systems with minimal configuration, it is widely used in prototypes, RAG applications, and research projects. Chroma can be deployed in several ways, including in-memory, with local disk persistence, in client-server mode, or through Chroma Cloud. When local persistence is enabled, documents, metadata, and management information are stored in a SQLite database. Additionally, Chroma uses indexes based on the HNSW algorithm to accelerate vector similarity searches.

This flexibility makes Chroma an excellent choice for experimentation, rapid prototyping, MVPs, and personal projects. On the other hand, applications that require greater scalability, high availability, or distributed architectures may require a different infrastructure, whether through Chroma Cloud or other solutions designed for production environments.

The goal of this section is to demonstrate how to use ChromaDB to store embeddings and perform similarity searches. To accomplish this, we will use 15 sentences organized into 3 topics: AI, Food, and Sports. The data is presented below:

data = {
    "AI": [
        "AI learns from data",
        "Robots can recognize images",
        "Models predict future trends",
        "Chatbots answer user questions",
        "Algorithms improve decision making",
    ],
    "Food": [
        "Pizza tastes very good",
        "Rice cooks very quickly",
        "Fresh fruit is healthy",
        "Soup warms cold days",
        "Bread smells really nice",
    ],
    "Sports": [
        "Players train every day",
        "Football requires teamwork",
        "Basketball improves quick reflexes",
        "Running builds strong endurance",
        "Teams celebrate big victories",
    ],
}

The code is shown below:

import chromadb
from uuid import uuid4
from sentence_transformers import SentenceTransformer

client = chromadb.PersistentClient("./chroma_db")

collection = client.create_collection(
    name="knowledge_base",
    metadata={
        "hnsw:space": "cosine"
    }
)

embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

documents = []
metadatas = []
ids = []

for topic, sentences in data.items():
    for sentence in sentences:
        documents.append(sentence)
        metadatas.append({"topic": topic})
        ids.append(str(uuid4()))

embeddings = embedding_model.encode(documents).tolist()

collection.add(
    ids=ids,
    documents=documents,
    metadatas=metadatas,
    embeddings=embeddings
)

query = "Machine learning systems discover hidden relationships in information"
query_embedding = embedding_model.encode([query])[0].tolist()
    
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3
)

print(results["documents"][0])

The data is persisted locally using the PersistentClient, which stores Chroma’s files in the chroma_db directory.
Since Chroma organizes data into collections, a collection named knowledge_base is created using the create_collection method. For this collection, the vector search metric is configured as cosine similarity through the hnsw:space parameter.
The sentence-transformers/all-MiniLM-L6-v2 model is used to convert the sentences into embeddings. This is an open-source model provided by the Sentence Transformers library. Next, a loop iterates over the sentences, generates metadata containing the corresponding topic, and creates unique identifiers for each record using UUID.
After that, the IDs, documents, metadata, and embeddings are added to the collection using the add method.
Finally, a query is executed to retrieve the three sentences that are most similar to the phrase “Machine learning systems discover hidden relationships in information”. To accomplish this, the query is first converted into an embedding using the same model employed during document indexing. The resulting vector is then sent to Chroma, which returns the three closest records in the vector space, as specified by the n_results=3 parameter.

The query results are shown below, displaying the 3 most similar sentences retrieved by ChromaDB:

"AI learns from data"
"Algorithms improve decision making"
"Robots can recognize images"

If you’re curious to explore the database generated by ChromaDB, you’ll notice that it creates several tables to store its internal data. To make it easier to understand, the SQL query below allows you to view the vectors (stored as BLOBs), the sentences, and their corresponding topics.

SELECT
t1.id AS embedding_id,
t1.vector,
t3.c0 AS sentence,
t4.string_value AS topic
FROM
embeddings_queue t1 INNER JOIN embeddings t2
ON t1.id = t2.embedding_id
INNER JOIN embedding_fulltext_search_content t3
ON t2.seq_id  = t3.id
INNER JOIN embedding_metadata t4
ON t2.seq_id = t4.id
WHERE t4.key = 'topic';

How to use Pinecone

Pinecone is a fully managed vector database designed to store, index, and query embeddings at scale. It is an excellent choice for production applications that require high availability, scalability, and low latency without the need to manage the underlying infrastructure. The platform abstracts tasks such as index creation and management, scaling, and infrastructure operations, allowing developers to focus on building their applications rather than administering the database. In addition, Pinecone can be deployed across multiple cloud providers, including AWS, Google Cloud Platform (GCP), and Microsoft Azure.

Unlike ChromaDB, Pinecone is not an open-source project and is available exclusively as a managed service. Nevertheless, it offers a free tier with storage and usage limits, allowing developers to experiment with the platform before adopting it in production applications.

As in the previous section, the goal of this section is to demonstrate how to use Pinecone. To accomplish this, the same sentences and query used in the ChromaDB example will be used here. The code is shown below:

from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
from uuid import uuid4
from dotenv import load_dotenv

load_dotenv()


pinecone = Pinecone()

pinecone.create_index(
    name="knowledge-base",
    dimension=384,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ),
    vector_type="dense"
)

embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

vectors = []

for topic, sentences in data.items():
    embeddings = embedding_model.encode(sentences)

    for sentence, embedding in zip(sentences, embeddings):
        vectors.append({
            "id": str(uuid4()),
            "values": embedding.tolist(),
            "metadata": {
                "topic": topic,
                "sentence": sentence
            }
        })

index = pinecone.Index("knowledge-base")
index.upsert(vectors=vectors)

query = "Machine learning systems discover hidden relationships in information"
query_embedding = embedding_model.encode([query])[0].tolist()

result = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True
)

sorted_results = sorted(
    result.matches, 
    key=lambda x: x.score,
    reverse=True
)

for match in sorted_results:
    print(match.metadata["sentence"])

First, you need to generate an API key from the Pinecone website and create a .env file containing the PINECONE_API_KEY environment variable with the generated key.
Unlike ChromaDB, which organizes data into collections, Pinecone uses indexes as its primary storage structure. The index is created using the create_index method, where you specify the index name (knowledge-base), the vector dimension (384), which matches the output dimension of the sentence-transformers/all-MiniLM-L6-v2 model, the similarity metric (cosine), the cloud provider, the region where the index will be created (AWS us-east-1 in this example), and the vector type (dense), since the embeddings generated by the model are dense vectors.
To insert data into the index, each vector must have a unique identifier (id) and its embedding values (values). Optionally, metadata can also be stored, which in this example consists of the topic and the original sentence. To accomplish this, a loop generates the embeddings, UUID-based identifiers, and the corresponding metadata.
To perform operations on the index, the Index class is instantiated using the index name as its parameter. The vectors are then uploaded using the upsert method, which inserts new records or updates existing ones that share the same identifier.
Finally, the query method is used to retrieve the three vectors that are most similar to the query vector. As in the ChromaDB example, the query sentence is first converted into an embedding using the same model employed during document indexing. The top_k=3 parameter specifies that the three most similar results should be returned, while include_metadata=True ensures that the metadata associated with each vector is also retrieved.

The query results are shown below, displaying the three sentences that are most similar to the input query:

"AI learns from data"
"Algorithms improve decision making"
"Robots can recognize images"

How to use Qdrant

Qdrant is an open-source vector database designed for high-performance similarity search. It uses the HNSW algorithm for approximate nearest neighbor search and extends it with Filterable HNSW, enabling efficient vector search combined with metadata filtering. In addition, Qdrant provides multiple deployment options, allowing it to run locally in self-hosted environments (for example, using Docker or Kubernetes) or through Qdrant Cloud, Hybrid Cloud, and Private Cloud.

Overall, Qdrant is an excellent choice for applications that require high-performance vector search, complex metadata filtering, and greater control over the infrastructure and search mechanisms. As an open-source solution designed for production environments, it stands out in scenarios that demand scalability, operational flexibility, and efficient querying over large volumes of data.

As in the previous sections, the code below demonstrates how to use Qdrant to store embeddings and retrieve the sentences that are most similar to the input query.

import os
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from sentence_transformers import SentenceTransformer
from uuid import uuid4
from dotenv import load_dotenv

load_dotenv()


client = QdrantClient(
    url=os.getenv("QDRANT_ENDPOINT"),
    api_key=os.getenv("QDRANT_API_KEY")
)

client.create_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

points = []

for topic, sentences in data.items():
    for sentence in sentences:
        embedding = embedding_model.encode(sentence)

        point = PointStruct(
            id=str(uuid4()),
            vector=embedding.tolist(),
            payload={
                "topic": topic,
                "sentence": sentence
            }
        )

        points.append(point)

client.upsert(
  collection_name="knowledge_base",
  points=points,
)

query = "Machine learning systems discover hidden relationships in information"
query_embedding = embedding_model.encode([query])[0].tolist()

results = client.query_points(
    collection_name="knowledge_base",
    query=query_embedding,
    with_payload=True,
    limit=3
)

for point in results.points:
    print(point.payload["sentence"])

In this example, Qdrant Cloud is used. Therefore, the first step is to create a cluster on the platform. Once the cluster is created, Qdrant provides its endpoint and API key. These values are stored in a .env file using the QDRANT_ENDPOINT and QDRANT_API_KEY environment variables and are later used to instantiate the client through the QdrantClient class.
Next, a collection is created using the create_collection method. The vector configuration is specified through the vectors_config parameter using the VectorParams class. In this example, the vector dimension is set to 384, matching the output dimension of the sentence-transformers/all-MiniLM-L6-v2 model, while the selected similarity metric is cosine similarity (Distance.COSINE).
In Qdrant, each record stored in a collection is called a Point. Every point contains a unique identifier (id), a vector (vector), and, optionally, a payload containing metadata. In this example, the payload stores the topic and the original sentence. After the points are created, they are inserted into the collection using the upsert method.
Finally, a query is executed using the query_points method. The query sentence is first converted into an embedding using the same model employed during document indexing. The resulting embedding is then used as the query vector (query_embedding). The limit=3 parameter specifies that the three most similar points should be returned, while with_payload=True ensures that the payload associated with each result is also retrieved.

The query results are shown below, displaying the three most similar sentences retrieved by Qdrant:

"AI learns from data"
"Algorithms improve decision making"
"Robots can recognize images"

Other widely used vector databases

In addition to the three vector databases presented in the previous sections, several other solutions are also widely used:

Milvus: an open-source vector database with a distributed architecture, designed for large-scale similarity search. Its architecture decouples the storage and compute layers while supporting multiple ANN algorithms. It is an excellent choice for applications that store millions or even billions of vectors and require high scalability.
Weaviate: an open-source vector database that goes beyond vector storage by positioning itself as an AI-native data platform. It combines vectors, structured data, AI modules, and advanced search capabilities in a single solution. It stands out for its hybrid search, automatic vectorization, and seamless integration with RAG applications, making it an excellent choice for teams building AI applications with both structured and unstructured data.
pgvector: a PostgreSQL extension that adds vector data types and similarity search capabilities to a relational database. It supports ANN indexes, integrates natively with SQL, and is fully compatible with the PostgreSQL ecosystem. It is an excellent choice for teams that already rely on PostgreSQL and want to add similarity search to applications with moderate data volumes without introducing a dedicated vector database.

Finally, it is worth noting that all the vector databases presented in this post integrate with LangChain, making it easier to build embedding-based applications and allowing the underlying vector database to be replaced with minimal changes to the application code.