Top K Elements Technique
The "Top K Elements" technique involves finding the largest (or smallest) ( k ) elements from a dataset. This is a common problem in computer science with applications in data analysis, search engines, recommendation systems, and more.
Key Concepts
- Priority Queue (Min-Heap or Max-Heap): A heap data structure that allows efficient extraction of the minimum or maximum element.
- Quickselect Algorithm: A selection algorithm to find the ( k )th largest (or smallest) element in an unordered list.
- Sorting: Sorting the entire dataset and then selecting the top ( k ) elements.
- Bucket Sort or Counting Sort: Special techniques for integer data with a limited range.
Methods to Find Top K Elements
- Heap (Priority Queue) Method: Efficient for dynamic data and large datasets.
- Quickselect Algorithm: Efficient for static data with good average-case performance.
- Sorting: Simple but less efficient for large datasets.
- Bucket Sort or Counting Sort: Efficient for specific types of data.
1. Heap (Priority Queue) Method
Using a min-heap (for the largest ( k ) elements) or max-heap (for the smallest ( k ) elements) to maintain the top ( k ) elements.
Code:
import heapq
def top_k_elements(nums, k):
# Use a min-heap for the largest k elements
if k == 0:
return []
min_heap = nums[:k]
heapq.heapify(min_heap)
for num in nums[k:]:
if num > min_heap[0]:
heapq.heappushpop(min_heap, num)
return min_heap
# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(top_k_elements(nums, k)) # Output: [5, 6]
Explanation:
- Initialize: Create a min-heap with the first ( k ) elements.
- Process: For each remaining element, if it is larger than the smallest element in the heap, replace the smallest element.
- Result: The heap contains the top ( k ) largest elements.
2. Quickselect Algorithm
Quickselect is a selection algorithm to find the ( k )th largest (or smallest) element in an unordered list, which can then be used to find the top ( k ) elements.
Code:
import random
def quickselect(nums, k):
def partition(left, right, pivot_index):
pivot_value = nums[pivot_index]
nums[pivot_index], nums[right] = nums[right], nums[pivot_index]
store_index = left
for i in range(left, right):
if nums[i] < pivot_value:
nums[store_index], nums[i] = nums[i], nums[store_index]
store_index += 1
nums[right], nums[store_index] = nums[store_index], nums[right]
return store_index
def select(left, right, k_smallest):
if left == right:
return nums[left]
pivot_index = random.randint(left, right)
pivot_index = partition(left, right, pivot_index)
if k_smallest == pivot_index:
return nums[k_smallest]
elif k_smallest < pivot_index:
return select(left, pivot_index - 1, k_smallest)
else:
return select(pivot_index + 1, right, k_smallest)
n = len(nums)
return select(0, n - 1, n - k)
# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(quickselect(nums, k)) # Output: 5
Explanation:
- Partition: Use the partitioning step from the QuickSort algorithm to position the pivot element correctly.
- Select: Recursively select the ( k )th smallest element, which corresponds to the ( (n - k) )th largest element.
- Result: The algorithm returns the ( k )th largest element, and the elements larger than it can be found in the list.
3. Sorting
Sorting the array and selecting the top ( k ) elements.
Code:
def top_k_elements_sort(nums, k):
nums.sort(reverse=True)
return nums[:k]
# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(top_k_elements_sort(nums, k)) # Output: [6, 5]
Explanation:
- Sort: Sort the array in descending order.
- Select: Take the first ( k ) elements from the sorted array.
- Result: These are the top ( k ) largest elements.
4. Bucket Sort or Counting Sort
For integer data with a limited range, bucket sort or counting sort can be efficient.
Code (Counting Sort for a limited range):
def top_k_elements_counting_sort(nums, k):
max_val = max(nums)
count = [0] * (max_val + 1)
for num in nums:
count[num] += 1
result = []
for num in range(max_val, -1, -1):
while count[num] > 0 and len(result) < k:
result.append(num)
count[num] -= 1
return result
# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(top_k_elements_counting_sort(nums, k)) # Output: [6, 5]
Explanation:
- Count Frequency: Count the frequency of each element.
- Select: Traverse the count array from the highest value, selecting elements until ( k ) elements are selected.
- Result: These are the top ( k ) largest elements.
If you want to find the largest K numbers and the smallest K numbers in an array simultaneously, you can use a combination of Min-Heap and Max-Heap. Here's a step-by-step approach in English:
Approach:
- Initialization:
- Create a Min-Heap to keep track of the largest K elements.
- Create a Max-Heap to keep track of the smallest K elements. (Since Python's
heapq
module only provides a Min-Heap, you can simulate a Max-Heap by pushing negative values.)
- Iterate through the array:
- For each element in the array:
- Add it to the Min-Heap if it's larger than the smallest element in the heap (the root). If the Min-Heap exceeds size K, remove the smallest element.
- Add it to the Max-Heap (with a negative sign) if it's smaller than the largest element (the root). If the Max-Heap exceeds size K, remove the largest element.
- For each element in the array:
- Results:
- After processing all elements, the Min-Heap contains the largest K elements, and the Max-Heap (with the negative signs removed) contains the smallest K elements.
This method ensures that you maintain both the largest and smallest K elements as you process the array, and the time complexity remains efficient.
Code:
import heapq
def find_largest_and_lowest_k_numbers(nums, k):
# Min-Heap for largest K numbers
min_heap = nums[:k]
heapq.heapify(min_heap)
# Max-Heap for smallest K numbers (using negative values)
max_heap = [-num for num in nums[:k]]
heapq.heapify(max_heap)
# Process the remaining elements in the array
for num in nums[k:]:
# Maintain the largest K numbers
if num > min_heap[0]:
heapq.heapreplace(min_heap, num)
# Maintain the smallest K numbers
if -num > max_heap[0]:
heapq.heapreplace(max_heap, -num)
# Convert the Max-Heap back to positive numbers
lowest_k = [-x for x in max_heap]
return min_heap, lowest_k
# Example usage
nums = [3, 2, 1, 5, 6, 4, 8, 7]
k = 3
largest_k, lowest_k = find_largest_and_lowest_k_numbers(nums, k)
print("Largest K:", largest_k) # Output: [6, 7, 8]
print("Lowest K:", lowest_k) # Output: [1, 2, 3]
215. Kth Largest Element in an Array
Problem: Given an integer array nums
and an integer k
, return the k
th largest element in the array.
Approach:
- Min-Heap: Use a min-heap to keep track of the largest
k
elements in the array. The top element of the heap will be thek
th largest element. - Quickselect: Another approach is to use the Quickselect algorithm, which is based on the partition method used in QuickSort. This method is efficient with an average time complexity of (O(n)).
Min-Heap Code:
import heapq
def find_kth_largest(nums, k):
heap = nums[:k]
heapq.heapify(heap)
for num in nums[k:]:
if num > heap[0]:
heapq.heappushpop(heap, num)
return heap[0]
# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(find_kth_largest(nums, k)) # Output: 5
Quickselect Code:
def find_kth_largest(nums, k):
k = len(nums) - k
def quickselect(l, r):
pivot, p = nums[r], l
for i in range(l, r):
if nums[i] <= pivot:
nums[p], nums[i] = nums[i], nums[p]
p += 1
nums[p], nums[r] = nums[r], nums[p]
if p > k:
return quickselect(l, p - 1)
elif p < k:
return quickselect(p + 1, r)
else:
return nums[p]
return quickselect(0, len(nums) - 1)
# Example usage:
nums = [3, 2, 1, 5, 6, 4]
k = 2
print(find_kth_largest(nums, k)) # Output: 5
23. Merge k Sorted Lists
Problem: You are given an array of k
linked-lists, each linked-list is sorted in ascending order. Merge all the linked-lists into one sorted linked-list and return it.
Approach:
- Min-Heap: Use a min-heap to efficiently merge the
k
sorted linked lists. Each time, extract the smallest element from the heap and add it to the merged list.
Code:
import heapq
class ListNode:
def __init__(self, val=0, next=None):
self.val = val
self.next = next
def __lt__(self, other):
return self.val < other.val
def merge_k_lists(lists):
min_heap = []
for root in lists:
if root:
heapq.heappush(min_heap, root)
dummy = ListNode()
curr = dummy
while min_heap:
smallest_node = heapq.heappop(min_heap)
curr.next = smallest_node
curr = curr.next
if smallest_node.next:
heapq.heappush(min_heap, smallest_node.next)
return dummy.next
# Example usage:
# Assuming ListNode class is defined and linked lists are created.
lists = [list1, list2, list3] # Example linked-lists
result = merge_k_lists(lists)
Explanation:
- Min-Heap: Each list's head is added to the heap. The heap helps efficiently find and add the smallest element across all lists to the final merged list.
703. Kth Largest Element in a Stream
Problem: Design a class to find the k
th largest element in a stream. Implement the KthLargest
class:
KthLargest(int k, int[] nums)
Initializes the object with the integerk
and the stream of integersnums
.int add(int val)
Appends the integerval
to the stream and returns the element representing thek
th largest element in the stream.
Approach:
- Min-Heap: Maintain a min-heap of size
k
. The smallest element in the heap is thek
th largest element in the stream.
Code:
import heapq
class KthLargest:
def __init__(self, k, nums):
self.k = k
self.min_heap = nums
heapq.heapify(self.min_heap)
while len(self.min_heap) > k:
heapq.heappop(self.min_heap)
def add(self, val):
heapq.heappush(self.min_heap, val)
if len(self.min_heap) > self.k:
heapq.heappop(self.min_heap)
return self.min_heap[0]
# Example usage:
k = 3
nums = [4, 5, 8, 2]
kthLargest = KthLargest(k, nums)
print(kthLargest.add(3)) # Output: 4
print(kthLargest.add(5)) # Output: 5
print(kthLargest.add(10)) # Output: 5
print(kthLargest.add(9)) # Output: 8
print(kthLargest.add(4)) # Output: 8
Summary
- Heap (Priority Queue) Method:
- Pros: Efficient for dynamic data and large datasets.
- Cons: Slightly complex to implement.
- Quickselect Algorithm:
- Pros: Good average-case performance, efficient for static data.
- Cons: Worst-case performance can be poor.
- Sorting:
- Pros: Simple to implement.
- Cons: Less efficient for large datasets ((O(n \log n))).
- Bucket Sort or Counting Sort:
- Pros: Very efficient for integer data with a limited range.
- Cons: Not general-purpose, requires specific data characteristics.
Choosing the right method depends on the characteristics of the dataset and the specific requirements of the problem at hand.
'ML Engineering > python' 카테고리의 다른 글
08. Math (0) | 2024.08.07 |
---|---|
07. Subset Techniques (0) | 2024.08.06 |
05. Modified Binary Search Technique (0) | 2024.08.06 |
04. Binary Tree BFS Techniques (0) | 2024.08.06 |
03. Binary Tree DFS Techniques (0) | 2024.08.06 |