https://leetcode.com/problems/top-k-frequent-elements/description/?envType=company&envId=facebook&favoriteSlug=facebook-thirty-days

 

Here’s an updated structured script that includes the Quickselect approach for solving the Top K Frequent Elements problem, along with sorting, heap, and bucket sort methods.


You:
"To solve the problem of finding the k most frequent elements in a list of integers, there are a few different approaches we can take, depending on the input size and the desired level of efficiency. I’ll walk through each approach, from simplest to most optimized, including their pros and cons.

1. Sorting Solution

The simplest approach is to use sorting:

  • Steps: First, count the frequency of each element using a dictionary or Python's Counter class. Then, convert the frequency dictionary to a list of (element, frequency) tuples and sort this list by frequency in descending order. Finally, select the first k elements from the sorted list.
  • Time Complexity: O(n log n), due to sorting the entire list of elements by frequency.
  • Space Complexity: O(n) for the frequency dictionary and sorted list.
  • Pros:
    • Straightforward and easy to implement.
    • Suitable for small to moderate input sizes.
  • Cons:
    • Sorting is inefficient for large lists when we only need the top k elements. Sorting all elements doesn’t leverage the partial results we need.

2. Heap Solution (Efficient for Larger Lists with Small k)

A more efficient approach for larger inputs is to use a min-heap:

  • Steps: After creating a frequency dictionary, we use a min-heap to keep track of the k most frequent elements. For each element and its frequency, we push it into the heap. If the heap size exceeds k, we remove the element with the smallest frequency, ensuring that only the k most frequent elements remain.
  • Time Complexity: O(n log k), where we add n elements to a heap of size k.
  • Space Complexity: O(n) for the dictionary and O(k) for the heap.
  • Pros:
    • Efficient for large inputs where k is small relative to n.
    • Uses less space by storing only k elements in the heap.
  • Cons:
    • More complex to implement due to heap operations.

3. Bucket Sort Solution (Optimal for Frequency-Based Grouping)

An even more efficient approach in terms of time complexity is bucket sort:

  • Steps: After building the frequency dictionary, we create an array of buckets where each index represents a frequency count. Each bucket stores elements that appear that many times. Finally, we collect the top k elements by iterating through the buckets from highest to lowest frequency.
  • Time Complexity: O(n), as we only count elements and place them into frequency-based buckets.
  • Space Complexity: O(n) for the dictionary and buckets.
  • Pros:
    • Highly efficient for large inputs and avoids sorting or heap maintenance.
    • Works well for situations where k is close to n.
  • Cons:
    • Bucket sort can be less intuitive to implement, and requires extra space for the buckets.

4. Quickselect Solution (Optimal for Top-k Selection)

Another highly efficient solution, especially for very large lists, is Quickselect:

  • Steps: Quickselect is a partition-based algorithm similar to quicksort. After building the frequency dictionary, we convert it into a list of (element, frequency) pairs and use Quickselect to partially sort the list such that the k most frequent elements are positioned in the first k spots. We partition the list until the k-th most frequent element is in the correct position, and return the first k elements.
  • Time Complexity: O(n) on average, as Quickselect only partially sorts the list to find the top k elements.
  • Space Complexity: O(n) for the dictionary and list.
  • Pros:
    • Very efficient with an average-case complexity of O(n), especially for very large lists.
    • Avoids sorting the entire list, which makes it faster than the sorting approach.
  • Cons:
    • The worst-case complexity is O(n^2), though using random pivot selection helps mitigate this risk.
    • Quickselect is more complex to implement and understand compared to other solutions.

Summary:

  • Sorting: Simple but less efficient for large inputs, with O(n log n) complexity.
  • Heap: Ideal for large lists when k is much smaller than n, with O(n log k) complexity.
  • Bucket Sort: Optimized for large lists and frequency-based grouping, with O(n) complexity, though it requires additional space.
  • Quickselect: Offers the best average-case efficiency with O(n) time complexity, ideal for very large lists and when k is close to n.

Each solution has its trade-offs, so I’d choose the approach based on input size and constraints. For large lists with small k, the heap or Quickselect approach would be optimal, while for lists where k is close to n, bucket sort may be best."


This script provides a structured breakdown of each solution, explaining when each approach is optimal based on the constraints, making it easy to decide the best solution.


Problem Statement Recap:
You are given a list of integers and an integer k. The goal is to return the k most frequent elements in the list.


1. Sorting Solution

Explanation:

  • Start by counting the frequency of each element in the list. We can use Python's Counter from the collections module to achieve this.
  • Once we have the frequency of each element, we convert the frequency dictionary into a list of tuples, where each tuple is (element, frequency).
  • Sort this list of tuples in descending order based on frequency.
  • Finally, select the first k elements from this sorted list.

Implementation:

from collections import Counter
from typing import List

def top_k_frequent_sort(nums: List[int], k: int) -> List[int]:
    # Step 1: Count frequencies
    freq_count = Counter(nums)
    # Step 2: Sort items by frequency in descending order
    sorted_items = sorted(freq_count.items(), key=lambda item: item[1], reverse=True)
    # Step 3: Extract the first k elements
    return [item[0] for item in sorted_items[:k]]

# Example usage
nums = [1, 1, 1, 2, 2, 3]
k = 2
print(top_k_frequent_sort(nums, k))  # Output: [1, 2]

Time Complexity: O(n log n)

  • Counting frequencies takes O(n), while sorting the items by frequency takes O(n log n).

Space Complexity: O(n) for storing the frequency dictionary and sorted list.

Pros:

  • Simple and straightforward to implement.
  • Effective for small to medium inputs.

Cons:

  • Sorting the entire frequency list is unnecessary when we only need the top k elements, making it less efficient for large inputs.

2. Heap Solution (Optimal for Large Lists with Small k)

Explanation:

  • After counting the frequency of each element, we use a min-heap of size k to keep track of the k most frequent elements.
  • We push each element along with its frequency into the heap.
    • If the heap exceeds size k, we remove the element with the smallest frequency (root of the min-heap).
  • By the end, the heap contains only the k most frequent elements.

Implementation:

import heapq
from collections import Counter
from typing import List

def top_k_frequent_heap(nums: List[int], k: int) -> List[int]:
    # Step 1: Count frequencies
    freq_count = Counter(nums)
    # Step 2: Use a min-heap of size k
    min_heap = []
    for num, freq in freq_count.items():
        heapq.heappush(min_heap, (freq, num))
        if len(min_heap) > k:
            heapq.heappop(min_heap)
    # Step 3: Extract elements from the heap
    return [num for (freq, num) in min_heap]

# Example usage
nums = [1, 1, 1, 2, 2, 3]
k = 2
print(top_k_frequent_heap(nums, k))  # Output: [2, 1]

Time Complexity: O(n log k)

  • Counting frequencies takes O(n), and maintaining a heap of size k takes O(log k) time for each of the n elements.

Space Complexity: O(n) for the frequency dictionary and O(k) for the heap.

Pros:

  • Efficient for large inputs when k is much smaller than n.
  • Keeps track of only k elements, optimizing space and time usage.

Cons:

  • Slightly more complex to implement due to heap management.

3. Bucket Sort Solution (Optimal for Frequency-Based Grouping)

Explanation:

  • This approach leverages the fact that the frequency of elements is bounded by the length of the list (n), as the maximum frequency an element can have is n.
  • Create an array of n+1 empty buckets (index 0 to n). Each bucket at index i holds a list of elements that appear i times.
  • Place each element from the frequency dictionary into its corresponding bucket based on frequency.
  • Finally, iterate through the buckets in reverse order, collecting elements until we have k elements.

Implementation:

from collections import Counter
from typing import List

def top_k_frequent_bucket_sort(nums: List[int], k: int) -> List[int]:
    # Step 1: Count frequencies
    freq_count = Counter(nums)
    # Step 2: Initialize buckets where index is frequency
    buckets = [[] for _ in range(len(nums) + 1)]
    for num, freq in freq_count.items():
        buckets[freq].append(num)
    # Step 3: Gather top k elements from the buckets
    result = []
    for i in range(len(buckets) - 1, 0, -1):
        for num in buckets[i]:
            result.append(num)
            if len(result) == k:
                return result

# Example usage
nums = [1, 1, 1, 2, 2, 3]
k = 2
print(top_k_frequent_bucket_sort(nums, k))  # Output: [1, 2]

Time Complexity: O(n)

  • Counting frequencies takes O(n), and placing elements into buckets also takes O(n).

Space Complexity: O(n) for the frequency dictionary and buckets.

Pros:

  • Very efficient for problems where n is large and k is close to n.
  • No sorting or heap maintenance required, and it handles frequencies directly.

Cons:

  • Bucket sort can be less intuitive to implement.
  • Requires extra space for buckets, which may not be ideal for space-constrained environments.

Certainly! Here’s the Quickselect implementation for solving the Top K Frequent Elements problem.

4. Quickselect Solution

Explanation:

  • We start by building a frequency dictionary to count the occurrences of each element.
  • Then, we convert this dictionary into a list of tuples (element, frequency).
  • Using Quickselect, we partition the list of tuples so that the k most frequent elements are positioned in the first k spots in the list.
  • After partitioning, we return the elements from the first k positions.

Implementation:

import random
from collections import Counter
from typing import List, Tuple

def top_k_frequent_quickselect(nums: List[int], k: int) -> List[int]:
    # Step 1: Count frequencies
    freq_count = Counter(nums)
    # Convert the dictionary to a list of (element, frequency) pairs
    freq_items = list(freq_count.items())

    def partition(left: int, right: int, pivot_index: int) -> int:
        pivot_frequency = freq_items[pivot_index][1]
        # Move pivot to the end
        freq_items[pivot_index], freq_items[right] = freq_items[right], freq_items[pivot_index]
        store_index = left
        # Move all elements with frequency greater than pivot to the left
        for i in range(left, right):
            if freq_items[i][1] > pivot_frequency:
                freq_items[store_index], freq_items[i] = freq_items[i], freq_items[store_index]
                store_index += 1
        # Move pivot to its final place
        freq_items[right], freq_items[store_index] = freq_items[store_index], freq_items[right]
        return store_index

    def quickselect(left: int, right: int, k_smallest: int):
        if left == right:  # If the list contains only one element
            return
        # Select a random pivot index
        pivot_index = random.randint(left, right)
        # Partition the array around the pivot
        pivot_index = partition(left, right, pivot_index)
        # Recursively apply quickselect on the relevant half
        if k_smallest == pivot_index:
            return
        elif k_smallest < pivot_index:
            quickselect(left, pivot_index - 1, k_smallest)
        else:
            quickselect(pivot_index + 1, right, k_smallest)

    # Perform Quickselect for the k most frequent elements
    n = len(freq_items)
    quickselect(0, n - 1, k - 1)

    # Return the first k elements' values from freq_items
    return [item[0] for item in freq_items[:k]]

# Example usage
nums = [1, 1, 1, 2, 2, 3]
k = 2
print(top_k_frequent_quickselect(nums, k))  # Output: [1, 2]

Explanation of Key Steps:

  1. Partition Function:
    • Selects a pivot and rearranges elements such that all elements with frequency higher than the pivot are on the left, and all elements with frequency lower than the pivot are on the right.
    • This allows us to position elements based on their frequency.
  2. Quickselect Function:
    • Partitions the list around a pivot index until the k-th most frequent element is in the correct position.
    • This process allows us to get the top k frequent elements in average O(n) time without fully sorting the list.

Pros and Cons:

  • Pros: Efficient with an average time complexity of O(n), ideal for large lists.
  • Cons: The worst-case time complexity is O(n^2), though random pivot selection mitigates this in practice.

Summary

  • Sorting Solution: Simple but inefficient for large n, with O(n log n) complexity.
  • Heap Solution: Ideal for large n with small k, with O(n log k) complexity.
  • Bucket Sort Solution: Optimal for large n and frequency-based grouping, with O(n) complexity, but uses more space.

Here's a structured script for explaining how to solve the K Closest Points to Origin problem with various approaches, including brute-force, heap, and optimized techniques, along with pros and cons:


You:
"To solve the problem of finding the k closest points to the origin in a list of points, we can explore multiple approaches depending on the input size and desired efficiency. The goal is to calculate the Euclidean distance for each point from the origin (0, 0) and return the k points with the smallest distances.

  1. Brute-Force Approach (Sorting)
    The simplest approach is to calculate the distance of each point from the origin, store these distances in a list, and then sort the list based on these distances. We can then select the first k points from this sorted list.
    • Calculate the Euclidean distance squared for each point to avoid using sqrt (not necessary for comparing distances).
    • Sort the list of points based on the calculated distances.
    • Return the first k points from the sorted list.
    Pros:
    • Straightforward and easy to implement.
    Cons:
    • Sorting the list takes O(n log n) time, which can be inefficient for large n if only k smallest distances are required.
    • This approach doesn’t leverage the fact that we only need the closest k points, not the full sorted list.
  2. Steps:
  3. Heap Approach (Optimized for Efficiency)
    To improve efficiency, we can use a max-heap to store the k closest points while traversing the points list:
    • We first add the first k points to the max-heap based on their distances to the origin.
    • For each additional point, calculate its distance and compare it with the root of the max-heap (the largest distance among the k closest points so far). If the new point’s distance is smaller, we remove the root and insert the new point into the heap.
    • This ensures that we only keep k points in the heap at any time.
    Pros:
    • O(n log k) time complexity, which is more efficient than sorting when k is much smaller than n.
    • The heap helps us efficiently maintain the k closest points without sorting the entire list.
    Cons:
    • Implementing and managing a max-heap might add complexity, but Python’s heapq library can simplify this.
  4. Quickselect Algorithm (Alternative Optimal Solution)
    Another approach is to use the Quickselect algorithm, which is similar to quicksort but only partially sorts the array to find the k smallest elements:
    • Select a pivot point and partition the array around this pivot based on distance.
    • After partitioning, if the pivot position is k, then the first k points in the list are the closest.
    • Otherwise, recursively apply quickselect on the relevant half of the array until the first k points are found.
    Pros:
    • O(n) average time complexity, which is faster than heap-based solutions for large inputs.
    • No additional memory overhead since quickselect works in-place.
    Cons:
    • The worst-case time complexity is O(n^2), though choosing a random pivot or median-of-medians strategy can mitigate this.
    • It’s more complex to implement than the heap approach.
  5. Alternative Approach (Using Sorted Containers for Smaller Inputs)
    For scenarios where the input list of points is relatively small, using Python’s built-in sorted function or even a data structure like a SortedList (from sortedcontainers library) can simplify implementation.
    • Sort the points based on the distance in O(n log n) time.
    • Return the first k points.
    This is typically less efficient than heap-based solutions for large n but can be simpler for small datasets.

To summarize:

  • For a basic solution, sorting works fine but is not efficient for large n.
  • For larger inputs, the max-heap approach is optimal for maintaining the closest k points in O(n log k) time.
  • For very large inputs, Quickselect provides an efficient in-place solution with average O(n) time complexity.

Given these options, I would choose the max-heap approach for balancing efficiency and ease of implementation, or Quickselect for the absolute best average time complexity."


Here are Python implementations for each solution to the K Closest Points to Origin problem: sorting, heap, and quickselect.

1. Sorting Solution

This approach calculates the distances and sorts the points by distance.

from typing import List

def k_closest_sort(points: List[List[int]], k: int) -> List[List[int]]:
    # Sort points by distance (no need for sqrt as we only need relative distance)
    points.sort(key=lambda point: point[0]**2 + point[1]**2)
    # Return the first k points
    return points[:k]

# Example usage
points = [[1, 3], [-2, 2], [5, 8], [0, 1]]
k = 2
print(k_closest_sort(points, k))  # Output: k closest points

2. Heap Solution

This approach uses a max-heap to keep track of the k closest points.

import heapq
from typing import List

def k_closest_heap(points: List[List[int]], k: int) -> List[List[int]]:
    # Initialize a max-heap with the first k points (negative distance for max-heap behavior)
    max_heap = [(-point[0]**2 - point[1]**2, point) for point in points[:k]]
    heapq.heapify(max_heap)

    # Iterate over the remaining points
    for point in points[k:]:
        distance = -point[0]**2 - point[1]**2  # Negative for max-heap
        # If current point is closer than the farthest in the heap
        if distance > max_heap[0][0]:
            heapq.heappop(max_heap)
            heapq.heappush(max_heap, (distance, point))

    # Extract the points from the heap
    return [point for (_, point) in max_heap]

# Example usage
points = [[1, 3], [-2, 2], [5, 8], [0, 1]]
k = 2
print(k_closest_heap(points, k))  # Output: k closest points

3. Quickselect Solution

This approach uses Quickselect to partition the points so that the k closest points are in the first k positions of the list.

import random
from typing import List

def k_closest_quickselect(points: List[List[int]], k: int) -> List[List[int]]:
    def distance(point):
        return point[0]**2 + point[1]**2

    def partition(left, right, pivot_index):
        pivot_distance = distance(points[pivot_index])
        # Move pivot to the end
        points[pivot_index], points[right] = points[right], points[pivot_index]
        store_index = left
        # Move all points closer than pivot to the left
        for i in range(left, right):
            if distance(points[i]) < pivot_distance:
                points[store_index], points[i] = points[i], points[store_index]
                store_index += 1
        # Move pivot to its final place
        points[right], points[store_index] = points[store_index], points[right]
        return store_index

    def quickselect(left, right, k_smallest):
        if left == right:  # If the list contains only one element
            return
        # Select a random pivot index
        pivot_index = random.randint(left, right)
        # Partition the array around the pivot
        pivot_index = partition(left, right, pivot_index)
        # Recursively apply the quickselect
        if k_smallest == pivot_index:
            return
        elif k_smallest < pivot_index:
            quickselect(left, pivot_index - 1, k_smallest)
        else:
            quickselect(pivot_index + 1, right, k_smallest)

    # Perform quickselect for the k closest points
    quickselect(0, len(points) - 1, k)
    # Return the first k elements
    return points[:k]

# Example usage
points = [[1, 3], [-2, 2], [5, 8], [0, 1]]
k = 2
print(k_closest_quickselect(points, k))  # Output: k closest points

Summary:

  • Sorting Solution: Simple but not optimal for large lists (O(n log n)).
  • Heap Solution: Efficient for large lists when k is small (O(n log k)).
  • Quickselect Solution: Most efficient average time complexity for large lists (O(n) on average).

Certainly! Here’s an explanation for each solution to the K Closest Points to Origin problem:


1. Sorting Solution

Explanation:

  • In this approach, we start by calculating the Euclidean distance squared for each point from the origin. We square the distances (point[0]**2 + point[1]**2) instead of taking the square root to avoid unnecessary computations and because the relative order of distances remains unchanged without square roots.
  • We sort the list of points based on these squared distances in ascending order.
  • Finally, we return the first k points from the sorted list as they represent the closest points.

Time Complexity: O(n log n)

  • Sorting the list of n points is the main operation and takes O(n log n) time.

Space Complexity: O(1) (assuming in-place sorting)

Pros:

  • Simple to implement and straightforward to understand.
  • Works well for relatively small input sizes.

Cons:

  • Sorting the entire list of points is inefficient for large inputs when we only need k closest points.
  • Doesn’t take advantage of the fact that we only need a partial result (k points) and could be optimized further for larger inputs.

2. Heap Solution (Optimal for Larger Inputs)

Explanation:

  • This approach uses a max-heap to store the k closest points. A max-heap allows us to efficiently maintain the k smallest distances at any given time.
  • We start by adding the first k points from points into the max-heap. We store each point in the heap with its negative squared distance, as Python’s heapq module only supports a min-heap, and we need a max-heap behavior to keep the farthest element at the root.
  • For each remaining point, we calculate its squared distance. If this new distance is smaller than the largest distance in the max-heap (i.e., the current farthest point), we replace the root with the new point, keeping only the k closest points in the heap.
  • After processing all points, the heap contains exactly the k closest points.

Time Complexity: O(n log k)

  • Constructing and maintaining a max-heap with only k elements takes O(log k) time per insertion.
  • For n points, this results in O(n log k) time, which is much more efficient than sorting for large n when k is small.

Space Complexity: O(k)

  • The heap stores only k points at any time, which is efficient in terms of space.

Pros:

  • Efficient for large inputs where k is much smaller than n, as we don’t need to sort all points.
  • Uses constant space for the heap, making it memory-efficient.

Cons:

  • Slightly more complex to implement, particularly due to the max-heap setup using Python’s min-heap functionality.

3. Quickselect Solution (Optimal Average-Case Time Complexity)

Explanation:

  • This approach uses the Quickselect algorithm, which is similar to quicksort but focuses only on partitioning the list until we have the k smallest elements at the start of the list.
  • We first select a pivot point randomly and partition the points based on their distance from the origin. All points with distances less than the pivot’s distance are moved to the left, and all with greater distances to the right.
  • After partitioning, if the pivot position is k, the first k elements are the closest points, and we return them.
  • If the pivot position is less than k, we apply quickselect recursively to the right side, as we need more points to reach k. If the pivot position is greater than k, we apply quickselect on the left side.
  • This process continues until we find the first k closest points.

Time Complexity: O(n) on average

  • Each partition step reduces the number of elements we’re working with, giving an average-case time complexity of O(n).
  • However, in the worst case (e.g., when the pivot is always poorly chosen), this algorithm can degrade to O(n^2).

Space Complexity: O(1) for in-place operations, though the recursive calls can add up.

Pros:

  • The most efficient average-case solution for large inputs, particularly when k is close to n.
  • In-place operations mean no additional space is needed beyond the input list.

Cons:

  • The worst-case time complexity is O(n^2), though using a random pivot helps avoid this in practice.
  • More complex to implement than heap or sorting solutions.

Summary of Approach Suitability

  • Sorting: Best for smaller lists or if you need a simple solution without concern for optimal efficiency.
  • Heap: Ideal for large lists where k is small compared to n, balancing ease of implementation with optimal performance.
  • Quickselect: The most efficient solution for very large lists when you want the lowest average time complexity, but it requires careful handling to avoid worst-case scenarios.

Each approach provides a useful solution depending on the input size and constraints, showing flexibility in problem-solving.

'ML Engineering > python' 카테고리의 다른 글

Heap/Quickselect | Top K Frequent Elements  (1) 2024.10.26
Heap/Quickselect| Finds the k-th smallest/largest element(s) in the list  (0) 2024.10.26
08. Math  (0) 2024.08.07
07. Subset Techniques  (0) 2024.08.06
06. Top K Elements Technique  (0) 2024.08.06

https://leetcode.com/problems/kth-largest-element-in-an-array/?envType=company&envId=facebook&favoriteSlug=facebook-thirty-days

 

"To approach this problem, I like to first consider the brute-force method because it helps clarify how the solution works and when it might become inefficient based on the input size.

For instance, let's say the problem is to find the 'k' largest elements in an array. If the input size 'n' is relatively small—let's say around 100 elements or fewer—I might start by sorting the array. Sorting takes O(n log n) time, so with 100 elements, that would be approximately 700 operations, which is manageable. Once sorted, I can simply take the last 'k' elements, which gives the correct answer with minimal complexity.

However, if the input size grows significantly—say, 1 million elements—sorting the entire array now takes about 20 million operations (O(n log n) with n = 1,000,000). At this scale, sorting is inefficient, especially if we only need a few elements, like the top 10. In such cases, it’s better to optimize the solution.

That's where heaps come into play. I would use a min-heap to store only 'k' elements. As I traverse the array, I compare each element to the smallest element in the heap (in O(1) time) and, if it's larger, I replace it (in O(log k) time). By the end of the traversal, the heap contains the 'k' largest elements. The time complexity for this approach is O(n log k), which, for large inputs, drastically reduces the number of operations. For example, with 1 million elements and 'k' equal to 10, this approach would only take about 140,000 operations, which is much more efficient than 20 million.

In summary:

  1. If the input size is small, I start with a brute-force approach because sorting is simple and sufficient for manageable input sizes.
  2. But if the input size is large, like in the range of hundreds of thousands or millions, I switch to a heap-based approach, which efficiently handles the problem in logarithmic time relative to 'k'."

You:
"After initially considering a brute-force solution, depending on the input size, I would next evaluate the more optimal approaches. The heap approach is one of the most commonly used because it allows us to efficiently maintain the 'k' largest or smallest elements.

For example, if we’re looking for the 'k' largest elements, I would use a min-heap of size 'k'. As I traverse the array, I compare each element to the root of the heap:

  • If the element is larger than the heap’s root (the smallest element in the heap), I replace the root with the new element and re-heapify. This takes O(log k) time for each operation, resulting in a total time complexity of O(n log k) for 'n' elements. This is efficient when 'k' is small compared to 'n', and avoids the inefficiencies of sorting the entire array.

However, there are also other approaches depending on the problem constraints:

  1. Quickselect Algorithm
    If I need the exact 'k-th' largest element, and I don’t necessarily need the elements in sorted order, I might consider the quickselect algorithm. Quickselect is a partition-based algorithm, similar to quicksort, but it only partially sorts the array around the 'k-th' element. This method works in O(n) on average and is efficient for finding one element (or a range) without fully sorting the array. It’s particularly useful when we’re concerned only with the 'k-th' element, rather than all the 'k' largest.
  2. Bucket Sort
    For specific scenarios where the input range is limited (e.g., when elements are integers within a known small range), I could use bucket sort. This approach involves dividing the array into buckets and distributing the elements into these buckets based on their values. After filling the buckets, we can gather the top 'k' elements by traversing the highest buckets. This method can achieve O(n) time if the range of values is small relative to 'n'. It’s a great option when the problem constraints allow, especially if I know the data distribution is skewed or limited.

To summarize:

  • I start with a brute-force approach for small input sizes or as a baseline.
  • I then move to a heap-based solution for larger input sizes where I need efficient handling of the 'k' largest or smallest elements.
  • If I’m only interested in the 'k-th' element, I might use quickselect for a more direct approach with average O(n) complexity.
  • In specific cases, like when the value range is known and limited, bucket sort might be the best option, offering linear time complexity for certain distributions."

You:
"Quickselect is an efficient algorithm for finding the 'k-th' smallest or largest element in an unsorted list. It's based on the same idea as quicksort, but instead of sorting the entire list, we focus only on the part of the list that contains the element we're looking for. This reduces the average time complexity to O(n), making it much faster than sorting, which takes O(n log n).

Let me walk through the approach step-by-step:

  1. Initial Setup
    We begin by picking a pivot element from the list. This can be any element, but in practice, choosing the middle element or a random element often works well to avoid worst-case scenarios.
  2. Partitioning
    Similar to quicksort, we partition the array around the pivot:
    • Elements smaller than the pivot are placed on the left.
    • Elements greater than the pivot are placed on the right.
    The key here is that, after partitioning, the pivot is in its correct position relative to the rest of the array—everything on the left is smaller, and everything on the right is larger.
  3. Decide Which Half to Search
    After partitioning, we check the position of the pivot:
    • If the pivot is in the 'k-th' position (for 'k-th' smallest), we have found our element, and we can return it.
    • If the pivot’s position is greater than 'k', we know the 'k-th' smallest element lies in the left subarray, so we recursively apply the same logic to that subarray.
    • If the pivot’s position is less than 'k', we search the right subarray for the 'k-th' smallest element.
    This recursive process continues until we find the desired element.
  4. Time Complexity
    In the average case, each partition step reduces the problem size by half, resulting in a time complexity of O(n). However, in the worst case (if the pivot is poorly chosen), the algorithm can degrade to O(n^2), similar to quicksort. But with a good pivot selection strategy (like choosing a random pivot), we can generally avoid this scenario.

Example:
Let’s say we’re looking for the 3rd smallest element in the array [7, 2, 1, 6, 8, 5, 3, 4]. Using quickselect:

  1. Pick a pivot—let's choose 4.
  2. Partition the array: [2, 1, 3] | 4 | [7, 6, 8, 5]. Now 4 is in its correct position (index 3).
  3. Since we’re looking for the 3rd smallest element, which is still in the left partition [2, 1, 3], we recursively apply quickselect to that subarray.
  4. Eventually, we find that 3 is the 3rd smallest element.

To summarize:

  • Quickselect is highly efficient for finding the 'k-th' smallest or largest element with an average time complexity of O(n).
  • Instead of sorting the entire array, it partitions the array recursively, focusing only on the necessary part.
  • It's a great alternative when we only need a specific element, rather than sorting everything."

import random

def quickselect(arr, k):
    """
    Finds the k-th smallest element in the list `arr`.

    Parameters:
    - arr (list of int): The list of integers.
    - k (int): The index (1-based) of the element to find in sorted order.

    Returns:
    - int: The k-th smallest element.
    """
    # Convert k to 0-based index for simplicity
    k = k - 1

    def partition(left, right, pivot_index):
        pivot_value = arr[pivot_index]
        # Move pivot to the end
        arr[pivot_index], arr[right] = arr[right], arr[pivot_index]
        store_index = left
        # Move all smaller elements to the left
        for i in range(left, right):
            if arr[i] < pivot_value:
                arr[i], arr[store_index] = arr[store_index], arr[i]
                store_index += 1
        # Move pivot to its final place
        arr[right], arr[store_index] = arr[store_index], arr[right]
        return store_index

    def select(left, right):
        # Base case: the list contains only one element
        if left == right:
            return arr[left]

        # Select a random pivot index
        pivot_index = random.randint(left, right)

        # Perform partitioning
        pivot_index = partition(left, right, pivot_index)

        # The pivot is in its final sorted position
        if pivot_index == k:
            return arr[pivot_index]
        elif pivot_index < k:
            # Go right
            return select(pivot_index + 1, right)
        else:
            # Go left
            return select(left, pivot_index - 1)

    return select(0, len(arr) - 1)

Example Usage:

arr = [7, 2, 1, 6, 8, 5, 3, 4]
k = 3
print(f"The {k}-th smallest element is {quickselect(arr, k)}.")

Explanation:

  • partition: This function arranges elements around a pivot so that all smaller elements are to the left and all larger elements are to the right.
  • select: This recursive function narrows down the search based on the position of the pivot.
  • Time Complexity: On average, Quickselect takes O(n) time due to halving the search space with each partition.

'ML Engineering > python' 카테고리의 다른 글

Heap/Quickselect | Top K Frequent Elements  (1) 2024.10.26
Heap/Quickselect | K Closest Points to Origin  (0) 2024.10.26
08. Math  (0) 2024.08.07
07. Subset Techniques  (0) 2024.08.06
06. Top K Elements Technique  (0) 2024.08.06

Neural Machine Translation with Byte-Level Subwords

- https://arxiv.org/pdf/1909.03341

- https://git.opendfki.de/yhamidullah/fairseq-stable/-/tree/noencoder/examples/byte_level_bpe

 

 

Byte Level Text Representation

 

Encoding Byte-Level Representation. We consider UTF8 encoding of text, which encodes each Unicode character into 1 to 4 bytes. This allows us to model a sentence as a sequence of bytes instead of characters. While there are 138K Unicode characters covering over 150 languages, we represent a sentence in any language as a sequence of UTF-8 bytes (248 out of 256 possible bytes).

 

A byte sequence representation of text is often much longer (up to 4x) than a character sequence representation, which makes it computationally demanding to use bytes as they are. As an alternative, we consider segmenting a byte sequence into variable-length n-grams (byte-level “subwords”). Specifically, we learn BPE vocabulary on the byte-level representation which extends UTF-8 byte set with byte n-grams. We denote this type of vocabulary as B(ytelevel)BPE in the rest of the paper. Figure 1 shows an example of BBPE tokenization.

 

BBPE symbols can be partial characters shared by different characters or the combination of complete and partial characters. This arbitrariness may necessitate incorporating a larger context surrounding each symbol for disambiguation and learning the character boundaries. In this work, we base our experiments on Transformer (Vaswani et al. 2017) models. We propose to use either a depth-wise convolutional layer (Kaiser, Gomez, and Chollet 2017) or a bidirectional recurrent layer with gated recurrent units (Cho et al. 2014, GRU,) to contextualize BBPE embeddings before feeding them into the model:

 

Decoding with Byte-Level Subwords. While any sentence can be represented as a byte sequence, the converse is, however, not necessarily true in that there are byte sequences that do not translate to valid character sequences. Empirically, we find that invalid outputs from trained models are very rare. We do not observe any in the experiments described below (note that one of them does have a large test set of 165K examples). And a common error pattern in halftrained models is redundant repeating bytes. In our system, we try to recover as many Unicode characters as possible from this error pattern efficiently in linear time. The algorithm is as follows: For a given byte sequence {B} N k=1, we denote the maximum number of characters that we can recover from it as f(k). Then f(k) has optimal substructure and can be solved by dynamic programming:

 

 

corresponds to a valid character, otherwise 0. When f(k) is calculated recursively, we also record the selections at each position k so that we can recover the solution through backtracking. The design of UTF-8 encoding ensures the uniqueness of this recovery process: for a character UTF-8 encoded with multiple bytes, its trailing bytes will not make a valid UTF-8 encoded character. Then the best selection in Eq. 1 is unique and so is the final solution.

BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS, Apple 2022

- https://arxiv.org/pdf/2205.00485

 

 

Byte-level models have been proposed for natural language processing (NLP) [9] [10] [11]. The idea is to convert text to a sequence of variable-length UTF-8 codewords, and to have the model predict one byte at each decoding step. The advantages of byte-level representation are compactness and universality, as any combination of languages may be represented with an output dimension of only 256. However, a sequence represented at the byte level is always much longer than its character-level counterpart for languages such as Chinese and Japanese [12], which is because many characters of these languages are represented by multiple bytes in UTF-8. As a result, a byte-level model can be error-prone since it needs to make multiple predictions for many single characters, and each prediction has a chance to make a mistake. To compensate for this drawback, [12] proposes byte-level subwords for neural machine translation. The idea is to apply byte pair encoding (BPE) [13] to UTF-8 codeword sequences and as a result, an approach referred to as byte-level BPE (BBPE). BBPE inherits the advantages of UTF-8 byte-level representation. BBPE is able to represent all languages while keeping the output dimension in check. At the same time, as BBPE tokens are in general longer than byte-level tokens, the approach reduces the number of steps required by the decoding process.

 

In this work, we investigate bilingual (English and Mandarin) E2E ASR models by exploring different types of output representations, including character-level, BPE, byte-level (UTF-8) and BBPE. Similar to some of the previous work cited, we build a single E2E model for utterance-based bilingual speech recognition. Our contributions are threefold. First, we compare the strengths and weaknesses of different output representations in monolingual and bilingual use cases. Second, we propose a method to adjust the bigram statistics in the BPE algorithm and show that the BBPE representation leads to accuracy improvements in the bilingual scenario. Finally, we analyze different representations and show how we might improve them for multilingual ASR.

 

OUTPUT REPRESENTATIONS FOR E2E ASR

 

Character-level Representation

 

Using a character-level representation in an E2E model means that the output symbol set for the model is the set of graphemes of the target language. In addition to graphemes, the output representation may also contain punctuation marks, digits, emojis or special tokens such as begin-of-sentence (BOS) or end-of-sentence (EOS). According to [14] [15], character-level representation is often a good representation for Mandarin E2E models, and this serves as one of the baselines in our experiments.

 

BPE Representation

 

The BPE algorithm [13] starts from the character representation and iteratively merges the most frequent bigrams given a training text corpus. At the end of this process, the BPE algorithm produces a symbol set that consists of subwords with different lengths. This symbol set can then be used by an E2E model as its output units. It is common to keep the single characters in the final symbol set, so unseen words in the test set can still be represented by the symbol set. For English, BPE is widely used in E2E ASR systems, as it improves accuracy and reduces computation due to the use of frequent subwords and the resulting shorter labeling sequences.

 

Byte-level Representation

 

Scalability is one of the important aspects in designing an output representation for a multilingual E2E ASR model. As the model supports more languages, the size of the symbol set increases. To tackle this problem [8] proposes a byte-level representation based on UTF-8. Instead of using characters or subwords as the symbols, byte-level model uses UTF-8 codewords as the output symbol set. The resulting representation is compact as each UTF-8 codeword only has 256 values so each symbol uses one byte. Yet, this representation is capable of representing any language, and adding more languages does not increase the size of the symbol set, which is an advantage compared to the character-level and BPE representation. However, byte-level representation has two drawbacks, first, it increases the length of the sequence by up to 4x [12], and it increases the number of decoding steps during inference. Second, not all byte sequences are valid UTF-8 sequences, which means the byte-level models may generate invalid byte sequences that require special handling.

 

To repair an invalid byte sequence, [8] proposes a dynamic programming algorithm to recover the Unicode characters given any byte sequence. We use this post-processing approach to recover characters from byte sequences as much as possible.

 

Byte-level BPE Representation

 

To circumvent the increase of sequence length for byte-level representation, [12] proposes byte-level BPE (BBPE) for neural machine translation, which applies BPE to byte-represented text. The advantage of this approach is that it reduces the sequence length by adopting frequent byte-level subwords and it keeps the size of the symbol set in check. It is important to note that BBPE is equivalent to BPE for many Latin-based languages, since in UTF-8, all Latin characters are single byte units. However, for languages like Chinese or Japanese, characters can use multiple bytes, so BBPE could be helpful. Similar to BPE representation, BBPE representation might generate invalid byte sequences, and post-processing using dynamic programming is necessary to remedy that. Another aspect is that if we keep all the single-byte UTF-8 codewords in the symbol set after BPE, BBPE can represent all languages, as with the byte-level representation.

 

Reference

 

[1] Anjuli Kannan, Arindrima Datta, Tara Sainath, Eugene Weinstein, Bhuvana Ramabhadran, Yonghui Wu, Ankur Bapna, and Zhifeng Chen, “Large-scale multilingual speech recognition with a streaming end-to-end model,” in Proceedings of the INTERSPEECH, 2019. 

[2] Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Ankish Bansal, Markus M¨uller, Sergio Murillo, Ariya Rastrow, Sri Garimella, et al., “Streaming end-to-end bilingual ASR systems with joint language identification,” in arXiv preprint arXiv:2007.03900, 2020. 

[3] Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert, “Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters,” pp. 4751–4755, 2020. 

[4] Ke Li, Jinyu Li, Guoli Ye, Rui Zhao, and Yifan Gong, “Towards code-switching ASR for end-to-end CTC models,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019. 

[5] Changhao Shan, Chao Weng, Guangsen Wang, Dan Su, Min Luo, Dong Yu, and Lei Xie, “Investigating end-to-end speech recognition for mandarin-english code-switching,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019. 

[6] Zimeng Qiu, Yiyuan Li, Xinjian Li, Florian Metze, and William M. Campbell, “Towards context-aware end-to-end code-switching speech recognition,” in Proceedings of the INTERSPEECH, 2020.

[8] Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, and William Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019, pp. 5621–5625. 

[9] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya, “Multilingual language processing from bytes,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, 2016, pp. 1296–1306. 

[10] Marta Ruiz Costa-Juss`a, Carlos Escolano Peinado, and Jos´e Adri´an Rodr´ıguez Fonollosa, “Byte-based neural machine translation,” in Proceedings of the First Workshop on Subword and Character Level Models in NLP, 2017, pp. 154–158. 

[11] Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel, “Byt5: Towards a token-free future with pre-trained byte-tobyte models,” 2021. 

[12] Changhan Wang, Kyunghyun Cho, and Jiatao Gu, “Neural machine translation with byte-level subwords,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 9154–9160.

OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR, Apple 2024

Introduction

End-to-end (E2E) neural networks are flexible and accurate models for multilingual automatic speech recognition (ASR). The output of such a multilingual model is often unions of characters or subwords of the supported languages. However, as the number of languages increases, the size of the output layer increases, which can negatively affect compute, memory usage and asset size. This problem is more prominent when the system supports languages that have large character sets, such as Chinese, Japanese and Korean (CJK). To tackle this problem, previous work proposed the use of byte level representation for E2E ASR [1, 2]. By using UTF-8 [3] codewords as the underlying base tokens, the output vocabulary is no longer constrained by the character sets of each language, allowing developers to choose a vocabulary size based on compute, and memory constraints. One well-known multilingual ASR system that uses UTF-8 subwords is Whisper [4].

UTF-8 aims to represent all the characters used in major languages. The encoding and decoding processes are designed to be simple and efficient. UTF-8 is a variable length prefix code where each character is represented by one to four bytes. Most byte sequences are not valid UTF-8 strings, and the UTF-8 decoder needs to detect invalid sequences. UTF-8 also provides backward compatibility, where ASCII characters are represented by a single byte and they are the same as the ASCII encoding. While UTF-8 has proven to be an effective output representation for ASR, it is unclear whether it is optimal. For example, characters with similar pronunciations or meaning are not guaranteed to share the same prefixes. In addition, the large number of invalid byte sequences means the model needs to identify valid UTF-8 strings, an additional burden.

 

UTF-8 BASED REPRESENTATION

UTF-8 based models have been proposed for natural language processing (NLP) [5] [6] [7]. The idea is to convert text to a sequence of variable-length UTF-8 codewords, and to have the model predict one byte at each decoding step. The advantages of byte-level representation are compactness and universality, as any combination of languages may be represented with an output dimension of only 256. However, a sequence represented at byte level is often longer than its characterlevel counterpart, especially for CJK languages [8]. This is because while Latin characters are represented by a single byte, many CJK characters and accented characters are represented by multiple bytes. As a result, a byte-level model can be error-prone since it needs to make multiple predictions for many single characters, and each prediction might make a mistake.

To compensate for the drawback of making byte level mistakes, [1, 2] propose byte-level subwords for E2E ASR. The idea is to apply byte pair encoding (BPE) [9] to UTF-8 codeword sequences to create UTF-8 subwords. As subwords are in general longer than byte-level tokens, this approach reduces the number of steps required by the decoding process. However, BPE does not guarantee that the output will be a valid UTF-8 sequence. To repair an invalid byte sequence, [1] proposes a dynamic programming algorithm to recover as many characters as possible given any byte sequence. While this dynamic programming approach ensures the output sequence is always valid, it optimizes for the number of valid characters, not ASR quality.

 

Reference

[1] Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, and William Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019, pp. 5621–5625.

[2] L. Deng, R. Hsiao, and A. Ghoshal, “Bilingual endto-end ASR with byte-level subwords,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2022.

[8] Changhan Wang, Kyunghyun Cho, and Jiatao Gu, “Neural machine translation with byte-level subwords,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 9154–9160.

[9] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2016, pp. 1715–1725.

"안녕, teacher 의성"이라는 문장을 UTF-8 인코딩byte-level BPE를 활용해 처리하는 과정을 설명해 볼게요.

1. UTF-8 인코딩 과정

UTF-8은 유니코드 문자를 바이트(byte)로 변환하는 방식입니다. 예를 들어, "안녕, teacher 의성" 문장을 UTF-8로 변환하면 각 글자가 바이트 시퀀스로 표현됩니다.

  • "안": EC 95 88
  • "녕": EB 85 95
  • ",": 2C (ASCII 문자)
  • " ": 20 (공백)
  • "t": 74
  • "e": 65
  • "a": 61
  • "c": 63
  • "h": 68
  • "e": 65
  • "r": 72
  • " ": 20 (공백)
  • "의": EC 9D 98
  • "성": EC 84 B1

이렇게 UTF-8 인코딩을 사용하면 각 문자와 기호가 바이트 시퀀스로 변환됩니다.

2. Byte-Level BPE 과정

Byte-Level BPE는 이 바이트 시퀀스를 이용해 자주 등장하는 바이트 쌍을 합치는 과정을 통해 토큰화를 수행합니다. 아래는 이를 통해 진행되는 과정입니다:

  1. 초기 토큰화:
    • 각 문자는 UTF-8 바이트로 변환된 상태에서 개별 바이트 단위로 나눠집니다.
    • 예를 들어, "안녕, teacher 의성"은 다음과 같이 표현됩니다.
      [EC, 95, 88, EB, 85, 95, 2C, 20, 74, 65, 61, 63, 68, 65, 72, 20, EC, 9D, 98, EC, 84, B1]
  2. 바이트 쌍 빈도 계산:
    • 각 인접한 바이트 쌍의 빈도를 계산합니다. 예를 들어, EC95, 9588 같은 인접한 쌍들의 빈도를 셉니다.
    • 예를 들어, EC 95가 가장 자주 나타나는 경우 이를 하나의 새로운 토큰으로 합칩니다.
  3. 쌍 합치기:
    • 가장 빈도가 높은 쌍을 합쳐 새로운 토큰을 만듭니다.
    • 예를 들어, EC 95가 합쳐지면, 이를 EC95라는 새로운 토큰으로 저장하고, 원래 시퀀스에서도 해당 부분을 대체합니다.
    • 이 과정을 반복하여 가장 자주 나타나는 쌍을 합쳐 나가면서 최종적으로 원하는 토큰 수에 도달할 때까지 계속 진행합니다.
  4. 최종 토큰화 결과:
    • 예를 들어, "안녕"과 같은 자주 등장하는 단어들은 하나의 토큰으로 합쳐질 수 있습니다.
    • Byte-Level BPE는 결국 자주 등장하는 바이트 쌍을 기준으로 텍스트를 효과적으로 압축한 토큰 리스트를 생성합니다.

요약

  • "안녕, teacher 의성" 문장은 UTF-8 인코딩을 통해 바이트 시퀀스로 변환됩니다.
  • Byte-Level BPE는 이 바이트 시퀀스를 사용해 자주 등장하는 바이트 쌍을 결합하여 토큰을 만듭니다.
  • 이 과정은 다양한 문자나 특수기호를 처리할 때 특히 유용하며, 언어 간 일관된 처리가 가능합니다.

EC 95 88UTF-8 인코딩 형식의 바이트 시퀀스를 의미합니다. 이 포맷은 각 유니코드 문자를 16진수로 표현한 바이트 시퀀스로 나타냅니다.

상세 설명

  • UTF-8은 유니코드 문자를 1바이트에서 최대 4바이트까지 가변 길이로 인코딩합니다.
  • EC 95 88은 16진수(헥사) 코드로, UTF-8에서 한글 문자 하나를 표현할 때 사용됩니다.
  • 예를 들어, EC 95 88은 유니코드 문자 U+C548에 해당하며, 이는 한글 "안"을 의미합니다.

해석 과정

  1. 유니코드 값 찾기:
    • UTF-8에서 EC 95 88이라는 16진수 코드를 유니코드 값으로 변환하면, 이는 U+C548입니다.
  2. 유니코드 맵핑:
    • U+C548은 유니코드에서 "안"이라는 한글 글자를 나타냅니다.
  3. UTF-8 바이트 시퀀스:
    • UTF-8은 유니코드 값 U+C548을 3바이트 시퀀스인 EC 95 88로 인코딩한 것입니다.

요약

  • EC 95 88UTF-8 바이트 시퀀스로, "안"이라는 한글을 표현합니다.
  • 이 포맷은 16진수(헥사) 표현이며, 한글과 같은 유니코드 문자를 UTF-8에서 인코딩할 때 사용됩니다.

256이라는 숫자는 바이트(byte)와 관련된 값으로, 8비트의 최대값을 의미합니다. UTF-8 인코딩과 바이트 기반 처리에서 256의 의미를 설명하면 다음과 같습니다:

1. 바이트와 256의 관계

  • 바이트(byte)는 8비트(bit)로 구성된 데이터 단위입니다.
  • 1바이트는 0부터 255까지의 값을 가질 수 있습니다.
    • 0에서 255까지 총 256개의 서로 다른 값을 표현할 수 있는 이유는, 8비트로 나타낼 수 있는 수의 조합이 2^8 = 256이기 때문입니다.

2. UTF-8과 바이트

  • UTF-8은 유니코드 문자를 1바이트에서 최대 4바이트로 인코딩합니다.
    • 예를 들어, 영어 알파벳이나 ASCII 문자들은 1바이트(0~127)로 표현됩니다.
    • 한글과 같은 유니코드 문자는 2바이트 이상이 필요하며, 이때 각 바이트는 0~255 사이의 값을 갖습니다.

3. Byte-Level BPE와 256

  • Byte-Level BPE에서 바이트 단위를 사용해 텍스트를 토큰화할 때, 각 바이트가 0~255 범위 내의 값을 가집니다.
  • 이를 통해 모든 가능한 바이트 조합을 다룰 수 있으며, 256개의 개별 바이트는 다양한 유니코드 문자의 조합을 표현할 수 있습니다.

요약

  • 256은 바이트가 표현할 수 있는 값의 개수를 나타내며, 0부터 255까지 총 256가지입니다.
  • UTF-8에서 각 바이트는 0~255 사이의 값을 갖고, 이를 통해 다양한 문자를 표현할 수 있습니다.
  • Byte-Level BPE에서는 이 256개의 바이트 값을 토큰화의 기본 단위로 사용하여, 유니코드 문자와 그 조합을 다루게 됩니다.

UTF-8의 이론적인 표현 가능 문자 수는 유니코드의 설계와 관련이 있습니다. 유니코드 자체는 전 세계의 모든 문자를 포함하기 위해 설계되었고, UTF-8은 이 유니코드 문자를 바이트 단위로 인코딩합니다. UTF-8이 이론적으로 표현할 수 있는 문자 수는 다음과 같이 계산할 수 있습니다:

1. UTF-8의 구조

UTF-8은 1바이트에서 4바이트까지 가변 길이 인코딩 방식을 사용합니다. 각 바이트는 특정한 비트 패턴을 통해 인코딩되며, 이 패턴은 유니코드 코드 포인트 범위에 따라 달라집니다.

  • 1바이트 (7비트 사용): 0x00~0x7F (ASCII와 동일, 128개)
  • 2바이트 (11비트 사용): 0x0080~0x07FF (약 2,048개)
  • 3바이트 (16비트 사용): 0x0800~0xFFFF (약 65,536개)
  • 4바이트 (21비트 사용): 0x010000~0x10FFFF (약 1,112,064개)

2. 유니코드 코드 포인트 범위

  • 유니코드의 코드 포인트U+0000에서 U+10FFFF까지 정의되어 있습니다.
  • 이 범위는 약 1,114,112개의 코드 포인트를 포함합니다. 그러나, 실제로 사용 가능한 문자 수는 이보다 적습니다.

3. 실제 표현 가능한 문자 수

  • 모든 코드 포인트가 문자를 나타내지는 않습니다. 일부 코드 포인트는 제어 문자특수 용도로 예약되어 있으며, 문자로 직접적으로 사용되지 않습니다.
  • 현재 유니코드 표준(유니코드 15.0 기준)에서는 약 149,186개의 문자가 정의되어 있습니다. 이 숫자는 유니코드의 다양한 언어 스크립트와 기호를 포함합니다.

4. 이론적인 UTF-8 표현 가능 수

  • 이론적으로 UTF-8은 0x0000에서 0x10FFFF까지의 유니코드 범위를 모두 표현할 수 있기 때문에, 약 1,114,112개의 코드 포인트를 지원합니다.
  • 그러나, 실제로 유니코드 표준에서 정의한 문자 수는 이보다 적고, 추가로 제어 문자와 기타 예약된 코드 포인트를 고려해야 합니다.

요약

  • 이론적으로 UTF-8은 약 1,114,112개의 코드 포인트를 표현할 수 있습니다.
  • 실제로 현재 유니코드에서 정의된 문자는 약 149,186개이며, 유니코드는 계속해서 확장되고 있습니다.
  • UTF-8은 유니코드의 모든 코드 포인트를 바이트 시퀀스로 인코딩할 수 있는 유연한 방식입니다.

0x0080과 같은 표현은 16진수(Hexadecimal)를 사용해 숫자를 나타내는 방식입니다. 이 표현은 컴퓨터 과학과 프로그래밍에서 많이 사용되며, 특히 메모리 주소, 유니코드 코드 포인트, 바이트 데이터 등을 다룰 때 유용합니다. 아래에 0x0080과 같은 표현에 대해 자세히 설명해 드릴게요.

1. 16진수(Hexadecimal) 표현

  • 16진수0~9A~F를 사용하는 숫자 체계입니다. 이는 기수 16의 체계로, 한 자리 수가 최대 16가지 값을 가질 수 있습니다.
  • 예를 들어:
    • 0x0 = 0 (십진수)
    • 0xA = 10 (십진수)
    • 0xF = 15 (십진수)
    • 0x10 = 16 (십진수)
  • 16진수 표현에서 앞에 붙는 0x는 이 값이 16진수임을 나타내는 표기입니다. 즉, 0x0080은 16진수 80을 의미하며, 십진수로는 128에 해당합니다.

2. 0x0080의 의미

  • 0x0080은 16진수로 표현된 숫자 128입니다.
    • 00 부분은 자리수를 맞추기 위한 것이며, 실제 값은 80입니다.
    • 이는 십진수로 변환하면 128이 됩니다.
  • 이 값은 컴퓨터 메모리에서 바이트를 표현하거나, 유니코드 코드 포인트, 색상 코드, 메모리 주소 등을 나타낼 때 자주 사용됩니다.

3. 유니코드와 16진수

  • 유니코드의 코드 포인트는 보통 16진수 형식으로 표현됩니다.
    • 예를 들어, 유니코드 문자 U+0080은 유니코드 표준에서 128번째 코드 포인트를 나타냅니다.
    • UTF-8 인코딩에서, 0x0080에 해당하는 문자는 여러 바이트로 표현될 수 있습니다.
  • 0x0080은 유니코드의 코드 포인트로 보면, ASCII 확장 영역에 해당합니다. ASCII는 0x0000부터 0x007F(0~127)까지 1바이트로 표현되며, 그 이후의 0x0080(128) 이상의 값들은 2바이트 이상을 사용해 표현됩니다.

4. 16진수 사용의 장점

  • 메모리 주소 표현: 메모리 주소와 바이트 데이터를 표현할 때 16진수를 사용하면, 메모리의 크기를 더 쉽게 계산하고 확인할 수 있습니다.
  • 컴퓨터 친화적: 컴퓨터 하드웨어는 이진수(0과 1)로 데이터를 처리하지만, 이를 사람이 읽기 쉽게 표현하기 위해 2진수16진수로 변환하여 사용합니다. 16진수는 4비트를 한 자리로 표현할 수 있어 간결합니다.
    • 예: 0x10 (16진수)은 0001 0000 (2진수)로 표현할 수 있습니다.

요약

  • 0x008016진수 표기법으로, 십진수로는 128에 해당합니다.
  • 16진수 표기는 메모리 주소, 바이트 값, 유니코드 코드 포인트 등을 다룰 때 유용하며, 컴퓨터 과학에서 자주 사용됩니다.
  • 유니코드에서 0x0080은 ASCII 확장 영역에 해당하며, 이를 UTF-8 인코딩으로 표현할 때는 여러 바이트로 나타낼 수 있습니다.

2진수(바이너리)를 16진수(헥사)로 변환하는 방법은 비교적 간단합니다. 2진수01의 조합으로 구성되고, 16진수0-9와 A-F의 조합으로 이루어져 있습니다. 변환 방법을 설명하고 예시를 통해 보여드리겠습니다.

1. 2진수에서 16진수로 변환하는 방법

  • 2진수를 16진수로 변환하려면 4비트씩 묶어서 각 묶음을 16진수로 변환하면 됩니다.
  • 16진수는 4비트로 표현할 수 있기 때문에, 2진수의 비트들을 4비트 그룹으로 나누면 쉽게 변환할 수 있습니다.

2. 변환 과정

  1. 4비트씩 그룹화:
    • 2진수를 오른쪽부터 4비트씩 묶습니다. 만약 비트 수가 4의 배수가 아니라면 왼쪽에 0을 추가해서 4비트로 맞춥니다.
  2. 각 4비트를 16진수로 변환:
    • 각 4비트 그룹은 다음과 같이 16진수로 변환할 수 있습니다:
2진수 16진수
0000 0
0001 1
0010 2
0011 3
0100 4
0101 5
0110 6
0111 7
1000 8
1001 9
1010 A
1011 B
1100 C
1101 D
1110 E
1111 F
  1. 각 그룹을 대응되는 16진수로 변환하여 결합합니다.

3. 예시

예시 1: 2진수 11011011을 16진수로 변환

  1. 2진수를 4비트씩 그룹화합니다:
    1101 1011
  2. 각 그룹을 16진수로 변환합니다:
    • 1101D
    • 1011B
  3. 따라서, 11011011의 16진수 표현은 0xDB입니다.

예시 2: 2진수 101010을 16진수로 변환

  1. 2진수를 4비트씩 그룹화합니다. 이 경우 6비트이므로, 앞에 00을 추가해 4의 배수로 맞춥니다:
    0010 1010
  2. 각 그룹을 16진수로 변환합니다:
    • 00102
    • 1010A
  3. 따라서, 101010의 16진수 표현은 0x2A입니다.

4. 요약

  • 2진수에서 16진수로 변환하려면 4비트씩 묶어서 각 묶음을 변환하면 됩니다.
  • 4비트는 16진수의 한 자리수에 해당하므로, 이를 사용해 효율적으로 변환할 수 있습니다.
  • 예시: 110110110xDB, 1010100x2A.

이 과정을 통해 2진수와 16진수 간의 변환을 쉽게 수행할 수 있습니다.

Overview:

  • The DynamicBatchSampler groups data samples into batches.
  • It takes into account the length of each sample, a target batch size, and a maximum token limit per batch.
  • A batch is generated when either the batch size or the sum of sample lengths reaches the specified limits.

Example Code:

import random

class DynamicBatchSampler:
    def __init__(self, data_lengths, batch_size, max_tokens):
        """
        Initializes the DynamicBatchSampler with data lengths, target batch size, and max tokens constraint.

        :param data_lengths: List of lengths of each data sample.
        :param batch_size: The base size of each batch.
        :param max_tokens: The maximum number of tokens (or length sum) allowed per batch.
        """
        self.data_lengths = data_lengths
        self.batch_size = batch_size
        self.max_tokens = max_tokens

    def __iter__(self):
        # Shuffle the indices to get a different sampling order each time.
        indices = list(range(len(self.data_lengths)))
        random.shuffle(indices)

        batch = []
        total_length = 0

        for idx in indices:
            length = self.data_lengths[idx]
            # Check if adding this sample would exceed max tokens in the batch.
            if total_length + length > self.max_tokens or len(batch) >= self.batch_size:
                # Yield the current batch and reset for the next one.
                yield batch
                batch = []
                total_length = 0

            # Add the current sample to the batch.
            batch.append(idx)
            total_length += length

        # Yield any remaining samples as the last batch.
        if batch:
            yield batch

# Example usage:
data_lengths = [5, 8, 3, 7, 10, 2, 4, 6]  # Example lengths of samples.
batch_size = 3  # Maximum number of samples per batch.
max_tokens = 15  # Maximum total length of samples in a batch.

sampler = DynamicBatchSampler(data_lengths, batch_size, max_tokens)

# Iterate over the batches generated by the DynamicBatchSampler.
for batch in sampler:
    print(f"Batch indices: {batch}, Batch lengths: {[data_lengths[i] for i in batch]}")

Explanation:

  • Data Lengths: [5, 8, 3, 7, 10, 2, 4, 6] represents the lengths of different samples.
  • Batch Size: 3 means each batch can have up to 3 samples.
  • Max Tokens: 15 restricts each batch to a maximum total length of 15.

Output:

Batch indices: [6, 7], Batch lengths: [4, 6]
Batch indices: [4, 0], Batch lengths: [10, 5]
Batch indices: [1, 2, 5], Batch lengths: [8, 3, 2]
Batch indices: [3], Batch lengths: [7]

How It Works:

  • The sampler iterates over the data indices and groups them into batches.
  • A batch is finalized when adding another sample would exceed max_tokens or batch_size.
  • The example shows how batches are formed dynamically based on the length constraints, making it flexible for varying data sizes.

+ Recent posts