🔎ANN (IVF_FLAT)
Approximate Nearest Neighbor (ANN) indexing accelerates similarity search by trading a small amount of recall for large performance gains. In enVector, you can enable ANN by creating an index with index_params.index_type = "ivf_flat". Internally, IVF (Inverted File) partitions the vector space into nlist clusters (coarse centroids) and scans only nprobe of them at query time.
When to use
Large datasets where exact scan is too slow.
Latency-sensitive applications needing fast top-k results.
Acceptable to tune recall via
nprobeto balance speed/accuracy.
Key parameters
index_type(str): Set to"ivf_flat"to enable ANN.nlist(int): Number of coarse clusters (lists). Larger values → finer partitioning but larger index and build cost.default_nprobe(int): Default number of clusters to scan during search. Larger values → higher recall, higher latency.centroids(optional): Precomputednlistcentroids. Accepted types:2D NumPy
ndarraywith shape(nlist, dim)list[np.ndarray]list[list[float]]If omitted, the client generates random centroids and sends them to the server.
Notes:
Dimensions must match the index
dim(e.g., 32–4096). L2 normalization of vectors is recommended for stable Inner Product scoring (cosine equivalence).If you provide
centroids, ensurelen(centroids) == nlistand thatnlist ≤ number_of_vectorsused to fit centroids.
Client-provided centroids (recommended)
Providing centroids fitted on your data (e.g., KMeans) typically yields better recall/latency trade-offs than random centroids.
import numpy as np
from sklearn.cluster import KMeans
import pyenvector as ev
# 1) Init
ev.init(address="localhost:50050", key_path="./keys", key_id="test-key")
# 2) Prepare data (L2-normalized)
def get_random_vector(dim, seed=None):
if seed is not None:
np.random.seed(seed)
vec = np.random.uniform(-1.0, 1.0, dim)
norm = np.linalg.norm(vec)
if norm > 0:
vec = vec / norm
return vec
DIM = 512
num_data = 100
nlist = 8
vectors = [get_random_vector(DIM, seed=42 + i) for i in range(num_data)]
# 3) Fit centroids
kmeans = KMeans(n_clusters=nlist, random_state=42)
kmeans.fit(np.stack(vectors))
# Either pass as ndarray (preferred) or convert to list
centroids = kmeans.cluster_centers_ # np.ndarray shape: (nlist, dim)
# centroids = kmeans.cluster_centers_.tolist() # alternatively, list[list[float]]
# 4) Create IVF_FLAT index
index_params = {
"index_type": "ivf_flat",
"nlist": nlist,
"default_nprobe": 4,
"centroids": centroids, # optional but recommended
}
index_name = "test_index"
index = ev.create_index(index_name, DIM, index_params=index_params)
# 5) Insert data and search
index.insert(vectors, metadata=[f"Item {i+1}" for i in range(num_data)])
search_index = ev.Index(index_name)
search_params = {"nprobe": 2}
result = search_index.search(
[vectors[0]],
top_k=2,
output_fields=["metadata"],
search_params=search_params,
)[0]
print(result)Client-generated random centroids (quick start)
For quick experiments, you may skip KMeans and let the client initialize random centroids and pass them to the server. This reduces setup time but may underperform compared to data-fitted centroids.
index_params = {
"index_type": "ivf_flat",
"nlist": 8,
"default_nprobe": 4,
# no "centroids" → client will initialize randomly and send
}
index = ev.create_index("test_index", 512, index_params=index_params)Tuning tips
Choose
nlistto reflect dataset size (common heuristic: √N, then validate).Increase
nprobeto improve recall; decrease to improve latency. You can override per-search if supported by the client.Keep vectors L2-normalized if you use Inner Product scoring (cosine-equivalent with L2-normalized inputs).
Ensure
nlist ≤ number_of_vectorswhen fitting centroids with KMeans.
Troubleshooting
Poor recall or unstable latency: Fit centroids on representative data and increase
nprobe.Import errors for KMeans: Install scikit-learn in your environment (
pip install scikit-learn).
Last updated

