Heap/Quickselect | K Closest Points to Origin

2024. 10. 26. 05:32

Here's a structured script for explaining how to solve the K Closest Points to Origin problem with various approaches, including brute-force, heap, and optimized techniques, along with pros and cons:

You:
"To solve the problem of finding the k closest points to the origin in a list of points, we can explore multiple approaches depending on the input size and desired efficiency. The goal is to calculate the Euclidean distance for each point from the origin (0, 0) and return the k points with the smallest distances.

Brute-Force Approach (Sorting)
The simplest approach is to calculate the distance of each point from the origin, store these distances in a list, and then sort the list based on these distances. We can then select the first k points from this sorted list.
- Calculate the Euclidean distance squared for each point to avoid using sqrt (not necessary for comparing distances).
- Sort the list of points based on the calculated distances.
- Return the first k points from the sorted list.
Pros:
- Straightforward and easy to implement.
Cons:
- Sorting the list takes O(n log n) time, which can be inefficient for large n if only k smallest distances are required.
- This approach doesn’t leverage the fact that we only need the closest k points, not the full sorted list.
Steps:
Heap Approach (Optimized for Efficiency)
To improve efficiency, we can use a max-heap to store the k closest points while traversing the points list:
- We first add the first k points to the max-heap based on their distances to the origin.
- For each additional point, calculate its distance and compare it with the root of the max-heap (the largest distance among the k closest points so far). If the new point’s distance is smaller, we remove the root and insert the new point into the heap.
- This ensures that we only keep k points in the heap at any time.
Pros:
- O(n log k) time complexity, which is more efficient than sorting when k is much smaller than n.
- The heap helps us efficiently maintain the k closest points without sorting the entire list.
Cons:
- Implementing and managing a max-heap might add complexity, but Python’s heapq library can simplify this.
Quickselect Algorithm (Alternative Optimal Solution)
Another approach is to use the Quickselect algorithm, which is similar to quicksort but only partially sorts the array to find the k smallest elements:
- Select a pivot point and partition the array around this pivot based on distance.
- After partitioning, if the pivot position is k, then the first k points in the list are the closest.
- Otherwise, recursively apply quickselect on the relevant half of the array until the first k points are found.
Pros:
- O(n) average time complexity, which is faster than heap-based solutions for large inputs.
- No additional memory overhead since quickselect works in-place.
Cons:
- The worst-case time complexity is O(n^2), though choosing a random pivot or median-of-medians strategy can mitigate this.
- It’s more complex to implement than the heap approach.
Alternative Approach (Using Sorted Containers for Smaller Inputs)
For scenarios where the input list of points is relatively small, using Python’s built-in sorted function or even a data structure like a SortedList (from sortedcontainers library) can simplify implementation.
- Sort the points based on the distance in O(n log n) time.
- Return the first k points.
This is typically less efficient than heap-based solutions for large n but can be simpler for small datasets.

To summarize:

For a basic solution, sorting works fine but is not efficient for large n.
For larger inputs, the max-heap approach is optimal for maintaining the closest k points in O(n log k) time.
For very large inputs, Quickselect provides an efficient in-place solution with average O(n) time complexity.

Given these options, I would choose the max-heap approach for balancing efficiency and ease of implementation, or Quickselect for the absolute best average time complexity."

Here are Python implementations for each solution to the K Closest Points to Origin problem: sorting, heap, and quickselect.

1. Sorting Solution

This approach calculates the distances and sorts the points by distance.

from typing import List

def k_closest_sort(points: List[List[int]], k: int) -> List[List[int]]:
    # Sort points by distance (no need for sqrt as we only need relative distance)
    points.sort(key=lambda point: point[0]**2 + point[1]**2)
    # Return the first k points
    return points[:k]

# Example usage
points = [[1, 3], [-2, 2], [5, 8], [0, 1]]
k = 2
print(k_closest_sort(points, k))  # Output: k closest points

2. Heap Solution

This approach uses a max-heap to keep track of the k closest points.

import heapq
from typing import List

def k_closest_heap(points: List[List[int]], k: int) -> List[List[int]]:
    # Initialize a max-heap with the first k points (negative distance for max-heap behavior)
    max_heap = [(-point[0]**2 - point[1]**2, point) for point in points[:k]]
    heapq.heapify(max_heap)

    # Iterate over the remaining points
    for point in points[k:]:
        distance = -point[0]**2 - point[1]**2  # Negative for max-heap
        # If current point is closer than the farthest in the heap
        if distance > max_heap[0][0]:
            heapq.heappop(max_heap)
            heapq.heappush(max_heap, (distance, point))

    # Extract the points from the heap
    return [point for (_, point) in max_heap]

# Example usage
points = [[1, 3], [-2, 2], [5, 8], [0, 1]]
k = 2
print(k_closest_heap(points, k))  # Output: k closest points

3. Quickselect Solution

This approach uses Quickselect to partition the points so that the k closest points are in the first k positions of the list.

import random
from typing import List

def k_closest_quickselect(points: List[List[int]], k: int) -> List[List[int]]:
    def distance(point):
        return point[0]**2 + point[1]**2

    def partition(left, right, pivot_index):
        pivot_distance = distance(points[pivot_index])
        # Move pivot to the end
        points[pivot_index], points[right] = points[right], points[pivot_index]
        store_index = left
        # Move all points closer than pivot to the left
        for i in range(left, right):
            if distance(points[i]) < pivot_distance:
                points[store_index], points[i] = points[i], points[store_index]
                store_index += 1
        # Move pivot to its final place
        points[right], points[store_index] = points[store_index], points[right]
        return store_index

    def quickselect(left, right, k_smallest):
        if left == right:  # If the list contains only one element
            return
        # Select a random pivot index
        pivot_index = random.randint(left, right)
        # Partition the array around the pivot
        pivot_index = partition(left, right, pivot_index)
        # Recursively apply the quickselect
        if k_smallest == pivot_index:
            return
        elif k_smallest < pivot_index:
            quickselect(left, pivot_index - 1, k_smallest)
        else:
            quickselect(pivot_index + 1, right, k_smallest)

    # Perform quickselect for the k closest points
    quickselect(0, len(points) - 1, k)
    # Return the first k elements
    return points[:k]

# Example usage
points = [[1, 3], [-2, 2], [5, 8], [0, 1]]
k = 2
print(k_closest_quickselect(points, k))  # Output: k closest points

Summary:

Sorting Solution: Simple but not optimal for large lists (O(n log n)).
Heap Solution: Efficient for large lists when k is small (O(n log k)).
Quickselect Solution: Most efficient average time complexity for large lists (O(n) on average).

Certainly! Here’s an explanation for each solution to the K Closest Points to Origin problem:

1. Sorting Solution

Explanation:

In this approach, we start by calculating the Euclidean distance squared for each point from the origin. We square the distances (point[0]**2 + point[1]**2) instead of taking the square root to avoid unnecessary computations and because the relative order of distances remains unchanged without square roots.
We sort the list of points based on these squared distances in ascending order.
Finally, we return the first k points from the sorted list as they represent the closest points.

Time Complexity: O(n log n)

Sorting the list of n points is the main operation and takes O(n log n) time.

Space Complexity: O(1) (assuming in-place sorting)

Pros:

Simple to implement and straightforward to understand.
Works well for relatively small input sizes.

Cons:

Sorting the entire list of points is inefficient for large inputs when we only need k closest points.
Doesn’t take advantage of the fact that we only need a partial result (k points) and could be optimized further for larger inputs.

2. Heap Solution (Optimal for Larger Inputs)

Explanation:

This approach uses a max-heap to store the k closest points. A max-heap allows us to efficiently maintain the k smallest distances at any given time.
We start by adding the first k points from points into the max-heap. We store each point in the heap with its negative squared distance, as Python’s heapq module only supports a min-heap, and we need a max-heap behavior to keep the farthest element at the root.
For each remaining point, we calculate its squared distance. If this new distance is smaller than the largest distance in the max-heap (i.e., the current farthest point), we replace the root with the new point, keeping only the k closest points in the heap.
After processing all points, the heap contains exactly the k closest points.

Time Complexity: O(n log k)

Constructing and maintaining a max-heap with only k elements takes O(log k) time per insertion.
For n points, this results in O(n log k) time, which is much more efficient than sorting for large n when k is small.

Space Complexity: O(k)

The heap stores only k points at any time, which is efficient in terms of space.

Pros:

Efficient for large inputs where k is much smaller than n, as we don’t need to sort all points.
Uses constant space for the heap, making it memory-efficient.

Cons:

Slightly more complex to implement, particularly due to the max-heap setup using Python’s min-heap functionality.

3. Quickselect Solution (Optimal Average-Case Time Complexity)

Explanation:

This approach uses the Quickselect algorithm, which is similar to quicksort but focuses only on partitioning the list until we have the k smallest elements at the start of the list.
We first select a pivot point randomly and partition the points based on their distance from the origin. All points with distances less than the pivot’s distance are moved to the left, and all with greater distances to the right.
After partitioning, if the pivot position is k, the first k elements are the closest points, and we return them.
If the pivot position is less than k, we apply quickselect recursively to the right side, as we need more points to reach k. If the pivot position is greater than k, we apply quickselect on the left side.
This process continues until we find the first k closest points.

Time Complexity: O(n) on average

Each partition step reduces the number of elements we’re working with, giving an average-case time complexity of O(n).
However, in the worst case (e.g., when the pivot is always poorly chosen), this algorithm can degrade to O(n^2).

Space Complexity: O(1) for in-place operations, though the recursive calls can add up.

Pros:

The most efficient average-case solution for large inputs, particularly when k is close to n.
In-place operations mean no additional space is needed beyond the input list.

Cons:

The worst-case time complexity is O(n^2), though using a random pivot helps avoid this in practice.
More complex to implement than heap or sorting solutions.

Summary of Approach Suitability

Sorting: Best for smaller lists or if you need a simple solution without concern for optimal efficiency.
Heap: Ideal for large lists where k is small compared to n, balancing ease of implementation with optimal performance.
Quickselect: The most efficient solution for very large lists when you want the lowest average time complexity, but it requires careful handling to avoid worst-case scenarios.

Each approach provides a useful solution depending on the input size and constraints, showing flexibility in problem-solving.

'ML Engineering > python' 카테고리의 다른 글

[Sort] Merge Sort (0)	2025.01.11
Heap/Quickselect \| Top K Frequent Elements (1)	2024.10.26
Heap/Quickselect\| Finds the k-th smallest/largest element(s) in the list (0)	2024.10.26
08. Math (0)	2024.08.07
07. Subset Techniques (0)	2024.08.06

Notes