Finding K closest vectors using heaps in Python

4 min readAug 11, 2024

Picture of a binary max-heap data structure taken from Wikipedia

When it comes to efficiently solving problems that involve finding the top ‘K’ elements from a dataset, heaps provide an elegant solution. In Python, the heapq module simplifies heap operations, making it a valuable tool for various applications including searching for the k closest matches to a vector using the Euclidean distance using only standard Python libraries. Here, we'll walk through a practical example of how to use Python's heapq module to solve this problem.

What is a Heap?

A heap is a specialized tree-based data structure. The heaps in Python are binary heaps, which means they’re binary trees with two specific properties:

Heap Property: In a min-heap, for any given node i, the value of i is smaller than or equal to the values of its children. Conversely, in a max-heap, for any given node i, the value of i is larger than or equal to the values of its children.
Shape Property: All levels of the tree are fully filled except possibly the last level, which is filled from left to right.

In Python, the heapq module provides an implementation of the min-heap. We'll leverage this along with some tricks to simulate a max-heap for our use case.

Use Case: Finding K Closest Matches

Imagine we have a reference vector and a set of candidate vectors. We want to find the K vectors closest to our reference vector based on the Euclidean distance. The steps we’ll follow are as follows:

Calculate the Euclidean distance for each candidate vector from the reference vector.
Use a max-heap to keep track of the top K closest vectors.

Why Max-Heap?

While heapq provides a min-heap, we utilize a max-heap to keep the farthest of the K nearest neighbors at the root. This allows us to easily replace the farthest element when a closer candidate is found. To accomplish this in a min-heap environment, we store the negative of the distance.

Implementation

Let’s jump into the implementation.

Step 1: Import Required standard Libraries

import heapq
from math import sqrt

Step 2: Calculate Euclidean Distance

def euclidean_distance(vec1, vec2):
    return sqrt(sum((x - y)**2 for x, y in zip(vec1, vec2)))

Step 3: Define the Function to Find K Closest Matches

def find_k_closest(reference, candidates, k):
    # Initialize a max-heap
    max_heap = []
    for idx, vector in enumerate(candidates):
      dist = euclidean_distance(reference, vector)
      # Using negative distance to simulate max-heap
      heapq.heappush(max_heap, (-dist, idx))
      # Ensure the heap size does not exceed k
      if len(max_heap) > k:
          heapq.heappop(max_heap)
    # Extract the indices of the k closest vectors
    closest_indices = [idx for (_, idx) in max_heap]
    return [candidates[i] for i in closest_indices]

Usage Example

Let’s assume we have a reference vector and a list of candidate vectors:

reference_vector = [1, 2, 3]
candidate_vectors = [
    [2, 3, 4],
    [4, 5, 6],
    [1, 0, 1],
    [5, 3, 4],
    [0, 2, 1]
]

# Define the number of closest matches to find
k = 3

Now, we can find the k closest vectors using our function:

closest_matches = find_k_closest(reference_vector, candidate_vectors, k)
print("The k closest matches are:", closest_matches)

Incorporating Tuples and Record IDs

If our candidate vectors come with an associated record ID, we can modify our function to store and return these records as follows:

def find_k_closest_with_ids(reference, candidates_with_ids, k):
    # Initialize a max-heap
    max_heap = []

    for record_id, vector in candidates_with_ids:
      dist = euclidean_distance(reference, vector)
      # Using a tuple with negative distance and record ID to simulate max-heap
      heapq.heappush(max_heap, (-dist, record_id))
      # Ensure the heap size does not exceed k
      if len(max_heap) > k:
          heapq.heappop(max_heap)

    # Extract the record IDs of the k closest vectors
    closest_ids = [record_id for (_, record_id) in max_heap]
    return closest_ids

NOTE: The first value in the tuple that is added to the heap is used as the key for all the heap related operations and all the remaining values in the tuple are ignored for the same.

Heapify and Heapreplace

If we start with an existing list and want to maintain it as a heap, we can use heapq.heapify. Additionally, if we need to replace the root of the heap, heapq.heapreplace offers an efficient way to do so. As per the docs, heapreplace, pops and returns the smallest item from the heap, and also pushes the new item. The heap size doesn’t change. If the heap is empty, IndexError is raised.

This one step operation is more efficient than a heappop() followed by heappush() and can be more appropriate when using a fixed-size heap. The pop/push combination always returns an element from the heap and replaces it with item.

Usage of Heapify

existing_heap = [(-euclidean_distance(reference_vector, v), idx) for idx, v in enumerate(candidate_vectors[:k])]
heapq.heapify(existing_heap)

# Now, we can easily manage the heap of k elements
for idx in range(k, len(candidate_vectors)):
    dist = euclidean_distance(reference_vector, candidate_vectors[idx])
    if -existing_heap[0][0] > dist:
        heapq.heapreplace(existing_heap, (-dist, idx))

Conclusion

Using heaps, specifically max-heaps simulated with negative values, is a powerful technique to solve the problem of finding the K closest matches efficiently. By leveraging Python’s heapq module, we can keep our implementation concise and straightforward. This approach extends beyond finding closest points and can be applied to various top-k problems where efficiency is key.

Heaps are an understated yet immensely powerful data structure, and understanding their utility can significantly enhance the efficiency of your algorithms.