🔎ANN (IVF_FLAT)

Approximate Nearest Neighbor (ANN) indexing accelerates similarity search by trading a small amount of recall for large performance gains. In enVector, you can enable ANN by creating an index with index_params.index_type = "ivf_flat". Internally, IVF (Inverted File) partitions the vector space into nlist clusters (coarse centroids) and scans only nprobe of them at query time.


When to use

  • Large datasets where exact scan is too slow.

  • Latency-sensitive applications needing fast top-k results.

  • Acceptable to tune recall via nprobe to balance speed/accuracy.


Key parameters

  • index_type (str): Set to "ivf_flat" to enable ANN.

  • nlist (int): Number of coarse clusters (lists). Larger values → finer partitioning but larger index and build cost.

  • default_nprobe (int): Default number of clusters to scan during search. Larger values → higher recall, higher latency.

  • centroids (optional): Precomputed nlist centroids. Accepted types:

    • 2D NumPy ndarray with shape (nlist, dim)

    • list[np.ndarray]

    • list[list[float]] If omitted, the client generates random centroids and sends them to the server.

Notes:

  • Dimensions must match the index dim (e.g., 32–4096). L2 normalization of vectors is recommended for stable Inner Product scoring (cosine equivalence).

  • If you provide centroids, ensure len(centroids) == nlist and that nlist ≤ number_of_vectors used to fit centroids.


Providing centroids fitted on your data (e.g., KMeans) typically yields better recall/latency trade-offs than random centroids.

import numpy as np
from sklearn.cluster import KMeans
import pyenvector as ev

# 1) Init
ev.init(address="localhost:50050", key_path="./keys", key_id="test-key")

# 2) Prepare data (L2-normalized)
def get_random_vector(dim, seed=None):
    if seed is not None:
        np.random.seed(seed)
    vec = np.random.uniform(-1.0, 1.0, dim)
    norm = np.linalg.norm(vec)
    if norm > 0:
        vec = vec / norm
    return vec

DIM = 512
num_data = 100
nlist = 8
vectors = [get_random_vector(DIM, seed=42 + i) for i in range(num_data)]

# 3) Fit centroids
kmeans = KMeans(n_clusters=nlist, random_state=42)
kmeans.fit(np.stack(vectors))
# Either pass as ndarray (preferred) or convert to list
centroids = kmeans.cluster_centers_          # np.ndarray shape: (nlist, dim)
# centroids = kmeans.cluster_centers_.tolist()  # alternatively, list[list[float]]

# 4) Create IVF_FLAT index
index_params = {
    "index_type": "ivf_flat",
    "nlist": nlist,
    "default_nprobe": 4,
    "centroids": centroids,  # optional but recommended
}
index_name = "test_index"
index = ev.create_index(index_name, DIM, index_params=index_params)

# 5) Insert data and search
index.insert(vectors, metadata=[f"Item {i+1}" for i in range(num_data)])
search_index = ev.Index(index_name)
search_params = {"nprobe": 2}
result = search_index.search(
    [vectors[0]],
    top_k=2,
    output_fields=["metadata"],
    search_params=search_params,
    )[0]
print(result)

Client-generated random centroids (quick start)

For quick experiments, you may skip KMeans and let the client initialize random centroids and pass them to the server. This reduces setup time but may underperform compared to data-fitted centroids.

index_params = {
    "index_type": "ivf_flat",
    "nlist": 8,
    "default_nprobe": 4,
    # no "centroids" → client will initialize randomly and send
}
index = ev.create_index("test_index", 512, index_params=index_params)

Tuning tips

  • Choose nlist to reflect dataset size (common heuristic: √N, then validate).

  • Increase nprobe to improve recall; decrease to improve latency. You can override per-search if supported by the client.

  • Keep vectors L2-normalized if you use Inner Product scoring (cosine-equivalent with L2-normalized inputs).

  • Ensure nlist ≤ number_of_vectors when fitting centroids with KMeans.


Troubleshooting

  • Poor recall or unstable latency: Fit centroids on representative data and increase nprobe.

  • Import errors for KMeans: Install scikit-learn in your environment (pip install scikit-learn).

Last updated