Heap/Quickselect | Top K Frequent Elements

2024. 10. 26. 05:40

https://leetcode.com/problems/top-k-frequent-elements/description/?envType=company&envId=facebook&favoriteSlug=facebook-thirty-days

Here’s an updated structured script that includes the Quickselect approach for solving the Top K Frequent Elements problem, along with sorting, heap, and bucket sort methods.

You:
"To solve the problem of finding the k most frequent elements in a list of integers, there are a few different approaches we can take, depending on the input size and the desired level of efficiency. I’ll walk through each approach, from simplest to most optimized, including their pros and cons.

1. Sorting Solution

The simplest approach is to use sorting:

Steps: First, count the frequency of each element using a dictionary or Python's Counter class. Then, convert the frequency dictionary to a list of (element, frequency) tuples and sort this list by frequency in descending order. Finally, select the first k elements from the sorted list.
Time Complexity: O(n log n), due to sorting the entire list of elements by frequency.
Space Complexity: O(n) for the frequency dictionary and sorted list.
Pros:
- Straightforward and easy to implement.
- Suitable for small to moderate input sizes.
Cons:
- Sorting is inefficient for large lists when we only need the top k elements. Sorting all elements doesn’t leverage the partial results we need.

2. Heap Solution (Efficient for Larger Lists with Small `k`)

A more efficient approach for larger inputs is to use a min-heap:

Steps: After creating a frequency dictionary, we use a min-heap to keep track of the k most frequent elements. For each element and its frequency, we push it into the heap. If the heap size exceeds k, we remove the element with the smallest frequency, ensuring that only the k most frequent elements remain.
Time Complexity: O(n log k), where we add n elements to a heap of size k.
Space Complexity: O(n) for the dictionary and O(k) for the heap.
Pros:
- Efficient for large inputs where k is small relative to n.
- Uses less space by storing only k elements in the heap.
Cons:
- More complex to implement due to heap operations.

3. Bucket Sort Solution (Optimal for Frequency-Based Grouping)

An even more efficient approach in terms of time complexity is bucket sort:

Steps: After building the frequency dictionary, we create an array of buckets where each index represents a frequency count. Each bucket stores elements that appear that many times. Finally, we collect the top k elements by iterating through the buckets from highest to lowest frequency.
Time Complexity: O(n), as we only count elements and place them into frequency-based buckets.
Space Complexity: O(n) for the dictionary and buckets.
Pros:
- Highly efficient for large inputs and avoids sorting or heap maintenance.
- Works well for situations where k is close to n.
Cons:
- Bucket sort can be less intuitive to implement, and requires extra space for the buckets.

4. Quickselect Solution (Optimal for Top-k Selection)

Another highly efficient solution, especially for very large lists, is Quickselect:

Steps: Quickselect is a partition-based algorithm similar to quicksort. After building the frequency dictionary, we convert it into a list of (element, frequency) pairs and use Quickselect to partially sort the list such that the k most frequent elements are positioned in the first k spots. We partition the list until the k-th most frequent element is in the correct position, and return the first k elements.
Time Complexity: O(n) on average, as Quickselect only partially sorts the list to find the top k elements.
Space Complexity: O(n) for the dictionary and list.
Pros:
- Very efficient with an average-case complexity of O(n), especially for very large lists.
- Avoids sorting the entire list, which makes it faster than the sorting approach.
Cons:
- The worst-case complexity is O(n^2), though using random pivot selection helps mitigate this risk.
- Quickselect is more complex to implement and understand compared to other solutions.

Summary:

Sorting: Simple but less efficient for large inputs, with O(n log n) complexity.
Heap: Ideal for large lists when k is much smaller than n, with O(n log k) complexity.
Bucket Sort: Optimized for large lists and frequency-based grouping, with O(n) complexity, though it requires additional space.
Quickselect: Offers the best average-case efficiency with O(n) time complexity, ideal for very large lists and when k is close to n.

Each solution has its trade-offs, so I’d choose the approach based on input size and constraints. For large lists with small k, the heap or Quickselect approach would be optimal, while for lists where k is close to n, bucket sort may be best."

This script provides a structured breakdown of each solution, explaining when each approach is optimal based on the constraints, making it easy to decide the best solution.

Problem Statement Recap:
You are given a list of integers and an integer k. The goal is to return the k most frequent elements in the list.

1. Sorting Solution

Explanation:

Start by counting the frequency of each element in the list. We can use Python's Counter from the collections module to achieve this.
Once we have the frequency of each element, we convert the frequency dictionary into a list of tuples, where each tuple is (element, frequency).
Sort this list of tuples in descending order based on frequency.
Finally, select the first k elements from this sorted list.

Implementation:

from collections import Counter
from typing import List

def top_k_frequent_sort(nums: List[int], k: int) -> List[int]:
    # Step 1: Count frequencies
    freq_count = Counter(nums)
    # Step 2: Sort items by frequency in descending order
    sorted_items = sorted(freq_count.items(), key=lambda item: item[1], reverse=True)
    # Step 3: Extract the first k elements
    return [item[0] for item in sorted_items[:k]]

# Example usage
nums = [1, 1, 1, 2, 2, 3]
k = 2
print(top_k_frequent_sort(nums, k))  # Output: [1, 2]

Time Complexity: O(n log n)

Counting frequencies takes O(n), while sorting the items by frequency takes O(n log n).

Space Complexity: O(n) for storing the frequency dictionary and sorted list.

Pros:

Simple and straightforward to implement.
Effective for small to medium inputs.

Cons:

Sorting the entire frequency list is unnecessary when we only need the top k elements, making it less efficient for large inputs.

2. Heap Solution (Optimal for Large Lists with Small `k`)

Explanation:

After counting the frequency of each element, we use a min-heap of size k to keep track of the k most frequent elements.
We push each element along with its frequency into the heap.
- If the heap exceeds size k, we remove the element with the smallest frequency (root of the min-heap).
By the end, the heap contains only the k most frequent elements.

Implementation:

import heapq
from collections import Counter
from typing import List

def top_k_frequent_heap(nums: List[int], k: int) -> List[int]:
    # Step 1: Count frequencies
    freq_count = Counter(nums)
    # Step 2: Use a min-heap of size k
    min_heap = []
    for num, freq in freq_count.items():
        heapq.heappush(min_heap, (freq, num))
        if len(min_heap) > k:
            heapq.heappop(min_heap)
    # Step 3: Extract elements from the heap
    return [num for (freq, num) in min_heap]

# Example usage
nums = [1, 1, 1, 2, 2, 3]
k = 2
print(top_k_frequent_heap(nums, k))  # Output: [2, 1]

Time Complexity: O(n log k)

Counting frequencies takes O(n), and maintaining a heap of size k takes O(log k) time for each of the n elements.

Space Complexity: O(n) for the frequency dictionary and O(k) for the heap.

Pros:

Efficient for large inputs when k is much smaller than n.
Keeps track of only k elements, optimizing space and time usage.

Cons:

Slightly more complex to implement due to heap management.

3. Bucket Sort Solution (Optimal for Frequency-Based Grouping)

Explanation:

This approach leverages the fact that the frequency of elements is bounded by the length of the list (n), as the maximum frequency an element can have is n.
Create an array of n+1 empty buckets (index 0 to n). Each bucket at index i holds a list of elements that appear i times.
Place each element from the frequency dictionary into its corresponding bucket based on frequency.
Finally, iterate through the buckets in reverse order, collecting elements until we have k elements.

Implementation:

from collections import Counter
from typing import List

def top_k_frequent_bucket_sort(nums: List[int], k: int) -> List[int]:
    # Step 1: Count frequencies
    freq_count = Counter(nums)
    # Step 2: Initialize buckets where index is frequency
    buckets = [[] for _ in range(len(nums) + 1)]
    for num, freq in freq_count.items():
        buckets[freq].append(num)
    # Step 3: Gather top k elements from the buckets
    result = []
    for i in range(len(buckets) - 1, 0, -1):
        for num in buckets[i]:
            result.append(num)
            if len(result) == k:
                return result

# Example usage
nums = [1, 1, 1, 2, 2, 3]
k = 2
print(top_k_frequent_bucket_sort(nums, k))  # Output: [1, 2]

Time Complexity: O(n)

Counting frequencies takes O(n), and placing elements into buckets also takes O(n).

Space Complexity: O(n) for the frequency dictionary and buckets.

Pros:

Very efficient for problems where n is large and k is close to n.
No sorting or heap maintenance required, and it handles frequencies directly.

Cons:

Bucket sort can be less intuitive to implement.
Requires extra space for buckets, which may not be ideal for space-constrained environments.

Certainly! Here’s the Quickselect implementation for solving the Top K Frequent Elements problem.

4. Quickselect Solution

Explanation:

We start by building a frequency dictionary to count the occurrences of each element.
Then, we convert this dictionary into a list of tuples (element, frequency).
Using Quickselect, we partition the list of tuples so that the k most frequent elements are positioned in the first k spots in the list.
After partitioning, we return the elements from the first k positions.

Implementation:

import random
from collections import Counter
from typing import List, Tuple

def top_k_frequent_quickselect(nums: List[int], k: int) -> List[int]:
    # Step 1: Count frequencies
    freq_count = Counter(nums)
    # Convert the dictionary to a list of (element, frequency) pairs
    freq_items = list(freq_count.items())

    def partition(left: int, right: int, pivot_index: int) -> int:
        pivot_frequency = freq_items[pivot_index][1]
        # Move pivot to the end
        freq_items[pivot_index], freq_items[right] = freq_items[right], freq_items[pivot_index]
        store_index = left
        # Move all elements with frequency greater than pivot to the left
        for i in range(left, right):
            if freq_items[i][1] > pivot_frequency:
                freq_items[store_index], freq_items[i] = freq_items[i], freq_items[store_index]
                store_index += 1
        # Move pivot to its final place
        freq_items[right], freq_items[store_index] = freq_items[store_index], freq_items[right]
        return store_index

    def quickselect(left: int, right: int, k_smallest: int):
        if left == right:  # If the list contains only one element
            return
        # Select a random pivot index
        pivot_index = random.randint(left, right)
        # Partition the array around the pivot
        pivot_index = partition(left, right, pivot_index)
        # Recursively apply quickselect on the relevant half
        if k_smallest == pivot_index:
            return
        elif k_smallest < pivot_index:
            quickselect(left, pivot_index - 1, k_smallest)
        else:
            quickselect(pivot_index + 1, right, k_smallest)

    # Perform Quickselect for the k most frequent elements
    n = len(freq_items)
    quickselect(0, n - 1, k - 1)

    # Return the first k elements' values from freq_items
    return [item[0] for item in freq_items[:k]]

# Example usage
nums = [1, 1, 1, 2, 2, 3]
k = 2
print(top_k_frequent_quickselect(nums, k))  # Output: [1, 2]

Explanation of Key Steps:

Partition Function:
- Selects a pivot and rearranges elements such that all elements with frequency higher than the pivot are on the left, and all elements with frequency lower than the pivot are on the right.
- This allows us to position elements based on their frequency.
Quickselect Function:
- Partitions the list around a pivot index until the k-th most frequent element is in the correct position.
- This process allows us to get the top k frequent elements in average O(n) time without fully sorting the list.

Pros and Cons:

Pros: Efficient with an average time complexity of O(n), ideal for large lists.
Cons: The worst-case time complexity is O(n^2), though random pivot selection mitigates this in practice.

Summary

Sorting Solution: Simple but inefficient for large n, with O(n log n) complexity.
Heap Solution: Ideal for large n with small k, with O(n log k) complexity.
Bucket Sort Solution: Optimal for large n and frequency-based grouping, with O(n) complexity, but uses more space.

'ML Engineering > python' 카테고리의 다른 글

🚀 Dynamic Sliding Window Technique (0)	2025.03.19
[Sort] Merge Sort (0)	2025.01.11
Heap/Quickselect \| K Closest Points to Origin (0)	2024.10.26
Heap/Quickselect\| Finds the k-th smallest/largest element(s) in the list (0)	2024.10.26
08. Math (0)	2024.08.07

Notes