06. Top K Elements Technique

2024. 8. 6. 06:17

Top K Elements Technique

The "Top K Elements" technique involves finding the largest (or smallest) ( k ) elements from a dataset. This is a common problem in computer science with applications in data analysis, search engines, recommendation systems, and more.

Key Concepts

Priority Queue (Min-Heap or Max-Heap): A heap data structure that allows efficient extraction of the minimum or maximum element.
Quickselect Algorithm: A selection algorithm to find the ( k )th largest (or smallest) element in an unordered list.
Sorting: Sorting the entire dataset and then selecting the top ( k ) elements.
Bucket Sort or Counting Sort: Special techniques for integer data with a limited range.

Methods to Find Top K Elements

Heap (Priority Queue) Method: Efficient for dynamic data and large datasets.
Quickselect Algorithm: Efficient for static data with good average-case performance.
Sorting: Simple but less efficient for large datasets.
Bucket Sort or Counting Sort: Efficient for specific types of data.

1. Heap (Priority Queue) Method

Using a min-heap (for the largest ( k ) elements) or max-heap (for the smallest ( k ) elements) to maintain the top ( k ) elements.

Code:

import heapq

def top_k_elements(nums, k):
    # Use a min-heap for the largest k elements
    if k == 0:
        return []

    min_heap = nums[:k]
    heapq.heapify(min_heap)

    for num in nums[k:]:
        if num > min_heap[0]:
            heapq.heappushpop(min_heap, num)

    return min_heap

# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(top_k_elements(nums, k))  # Output: [5, 6]

Explanation:

Initialize: Create a min-heap with the first ( k ) elements.
Process: For each remaining element, if it is larger than the smallest element in the heap, replace the smallest element.
Result: The heap contains the top ( k ) largest elements.

2. Quickselect Algorithm

Quickselect is a selection algorithm to find the ( k )th largest (or smallest) element in an unordered list, which can then be used to find the top ( k ) elements.

Code:

import random

def quickselect(nums, k):
    def partition(left, right, pivot_index):
        pivot_value = nums[pivot_index]
        nums[pivot_index], nums[right] = nums[right], nums[pivot_index]
        store_index = left
        for i in range(left, right):
            if nums[i] < pivot_value:
                nums[store_index], nums[i] = nums[i], nums[store_index]
                store_index += 1
        nums[right], nums[store_index] = nums[store_index], nums[right]
        return store_index

    def select(left, right, k_smallest):
        if left == right:
            return nums[left]
        pivot_index = random.randint(left, right)
        pivot_index = partition(left, right, pivot_index)
        if k_smallest == pivot_index:
            return nums[k_smallest]
        elif k_smallest < pivot_index:
            return select(left, pivot_index - 1, k_smallest)
        else:
            return select(pivot_index + 1, right, k_smallest)

    n = len(nums)
    return select(0, n - 1, n - k)

# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(quickselect(nums, k))  # Output: 5

Explanation:

Partition: Use the partitioning step from the QuickSort algorithm to position the pivot element correctly.
Select: Recursively select the ( k )th smallest element, which corresponds to the ( (n - k) )th largest element.
Result: The algorithm returns the ( k )th largest element, and the elements larger than it can be found in the list.

3. Sorting

Sorting the array and selecting the top ( k ) elements.

Code:

def top_k_elements_sort(nums, k):
    nums.sort(reverse=True)
    return nums[:k]

# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(top_k_elements_sort(nums, k))  # Output: [6, 5]

Explanation:

Sort: Sort the array in descending order.
Select: Take the first ( k ) elements from the sorted array.
Result: These are the top ( k ) largest elements.

4. Bucket Sort or Counting Sort

For integer data with a limited range, bucket sort or counting sort can be efficient.

Code (Counting Sort for a limited range):

def top_k_elements_counting_sort(nums, k):
    max_val = max(nums)
    count = [0] * (max_val + 1)

    for num in nums:
        count[num] += 1

    result = []
    for num in range(max_val, -1, -1):
        while count[num] > 0 and len(result) < k:
            result.append(num)
            count[num] -= 1

    return result

# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(top_k_elements_counting_sort(nums, k))  # Output: [6, 5]

Explanation:

Count Frequency: Count the frequency of each element.
Select: Traverse the count array from the highest value, selecting elements until ( k ) elements are selected.
Result: These are the top ( k ) largest elements.

If you want to find the largest K numbers and the smallest K numbers in an array simultaneously, you can use a combination of Min-Heap and Max-Heap. Here's a step-by-step approach in English:

Approach:

Initialization:
- Create a Min-Heap to keep track of the largest K elements.
- Create a Max-Heap to keep track of the smallest K elements. (Since Python's heapq module only provides a Min-Heap, you can simulate a Max-Heap by pushing negative values.)
Iterate through the array:
- For each element in the array:
  - Add it to the Min-Heap if it's larger than the smallest element in the heap (the root). If the Min-Heap exceeds size K, remove the smallest element.
  - Add it to the Max-Heap (with a negative sign) if it's smaller than the largest element (the root). If the Max-Heap exceeds size K, remove the largest element.
Results:
- After processing all elements, the Min-Heap contains the largest K elements, and the Max-Heap (with the negative signs removed) contains the smallest K elements.

This method ensures that you maintain both the largest and smallest K elements as you process the array, and the time complexity remains efficient.

Code:

import heapq

def find_largest_and_lowest_k_numbers(nums, k):
    # Min-Heap for largest K numbers
    min_heap = nums[:k]
    heapq.heapify(min_heap)

    # Max-Heap for smallest K numbers (using negative values)
    max_heap = [-num for num in nums[:k]]
    heapq.heapify(max_heap)

    # Process the remaining elements in the array
    for num in nums[k:]:
        # Maintain the largest K numbers
        if num > min_heap[0]:
            heapq.heapreplace(min_heap, num)

        # Maintain the smallest K numbers
        if -num > max_heap[0]:
            heapq.heapreplace(max_heap, -num)

    # Convert the Max-Heap back to positive numbers
    lowest_k = [-x for x in max_heap]

    return min_heap, lowest_k

# Example usage
nums = [3, 2, 1, 5, 6, 4, 8, 7]
k = 3
largest_k, lowest_k = find_largest_and_lowest_k_numbers(nums, k)
print("Largest K:", largest_k)  # Output: [6, 7, 8]
print("Lowest K:", lowest_k)    # Output: [1, 2, 3]

215. Kth Largest Element in an Array

Problem: Given an integer array nums and an integer k, return the kth largest element in the array.

Approach:

Min-Heap: Use a min-heap to keep track of the largest k elements in the array. The top element of the heap will be the kth largest element.
Quickselect: Another approach is to use the Quickselect algorithm, which is based on the partition method used in QuickSort. This method is efficient with an average time complexity of (O(n)).

Min-Heap Code:

import heapq

def find_kth_largest(nums, k):
    heap = nums[:k]
    heapq.heapify(heap)
    for num in nums[k:]:
        if num > heap[0]:
            heapq.heappushpop(heap, num)
    return heap[0]

# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(find_kth_largest(nums, k))  # Output: 5

Quickselect Code:

def find_kth_largest(nums, k):
    k = len(nums) - k

    def quickselect(l, r):
        pivot, p = nums[r], l
        for i in range(l, r):
            if nums[i] <= pivot:
                nums[p], nums[i] = nums[i], nums[p]
                p += 1
        nums[p], nums[r] = nums[r], nums[p]
        if p > k:
            return quickselect(l, p - 1)
        elif p < k:
            return quickselect(p + 1, r)
        else:
            return nums[p]

    return quickselect(0, len(nums) - 1)

# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(find_kth_largest(nums, k))  # Output: 5

23. Merge k Sorted Lists

Problem: You are given an array of k linked-lists, each linked-list is sorted in ascending order. Merge all the linked-lists into one sorted linked-list and return it.

Approach:

Min-Heap: Use a min-heap to efficiently merge the k sorted linked lists. Each time, extract the smallest element from the heap and add it to the merged list.

Code:

import heapq

class ListNode:
    def __init__(self, val=0, next=None):
        self.val = val
        self.next = next

    def __lt__(self, other):
        return self.val < other.val

def merge_k_lists(lists):
    min_heap = []
    for root in lists:
        if root:
            heapq.heappush(min_heap, root)

    dummy = ListNode()
    curr = dummy

    while min_heap:
        smallest_node = heapq.heappop(min_heap)
        curr.next = smallest_node
        curr = curr.next
        if smallest_node.next:
            heapq.heappush(min_heap, smallest_node.next)

    return dummy.next

# Example usage:
# Assuming ListNode class is defined and linked lists are created.
lists = [list1, list2, list3]  # Example linked-lists
result = merge_k_lists(lists)

Explanation:

Min-Heap: Each list's head is added to the heap. The heap helps efficiently find and add the smallest element across all lists to the final merged list.

703. Kth Largest Element in a Stream

Problem: Design a class to find the kth largest element in a stream. Implement the KthLargest class:

KthLargest(int k, int[] nums) Initializes the object with the integer k and the stream of integers nums.
int add(int val) Appends the integer val to the stream and returns the element representing the kth largest element in the stream.

Approach:

Min-Heap: Maintain a min-heap of size k. The smallest element in the heap is the kth largest element in the stream.

Code:

import heapq

class KthLargest:
    def __init__(self, k, nums):
        self.k = k
        self.min_heap = nums
        heapq.heapify(self.min_heap)
        while len(self.min_heap) > k:
            heapq.heappop(self.min_heap)

    def add(self, val):
        heapq.heappush(self.min_heap, val)
        if len(self.min_heap) > self.k:
            heapq.heappop(self.min_heap)
        return self.min_heap[0]

# Example usage:
k = 3
nums = [4, 5, 8, 2]
kthLargest = KthLargest(k, nums)
print(kthLargest.add(3))  # Output: 4
print(kthLargest.add(5))  # Output: 5
print(kthLargest.add(10))  # Output: 5
print(kthLargest.add(9))  # Output: 8
print(kthLargest.add(4))  # Output: 8

Summary

Heap (Priority Queue) Method:
- Pros: Efficient for dynamic data and large datasets.
- Cons: Slightly complex to implement.
Quickselect Algorithm:
- Pros: Good average-case performance, efficient for static data.
- Cons: Worst-case performance can be poor.
Sorting:
- Pros: Simple to implement.
- Cons: Less efficient for large datasets ((O(n \log n))).
Bucket Sort or Counting Sort:
- Pros: Very efficient for integer data with a limited range.
- Cons: Not general-purpose, requires specific data characteristics.

Choosing the right method depends on the characteristics of the dataset and the specific requirements of the problem at hand.

'ML Engineering > python' 카테고리의 다른 글

08. Math (0)	2024.08.07
07. Subset Techniques (0)	2024.08.06
05. Modified Binary Search Technique (0)	2024.08.06
04. Binary Tree BFS Techniques (0)	2024.08.06
03. Binary Tree DFS Techniques (0)	2024.08.06

Notes