ARM Assembly: Sorting

Last updated on Nov 22, 2019 16 min read Programming, ARM, Assembly, Algorithms

xkcd.com/1185

After taking a hiatus for two years, I've started working with ARM assembly language again. I realized that the code I had been working on before had become a kind of utility library, so I rearranged the git repository to reflect that.

While doing so, I noticed that my sorting libraries were in an incomplete state, so I decided to work on finishing them. The result is that I now have four working sort functions that all operate in place on an existing array of 32bit signed integers.

Below you'll find more details about each of the sorting algorithms I implemented, along with the code. All of the following code snippets are licensed under the GPLv3.

Bubble Sort

Bubble sort is often one of the first sorting algorithms people learn. It works by iterating through the list of items to be sorted and swapping items that are out of order. After each iteration, if any swaps were made it iterates again. This results in the largest (or smallest) value “bubbling” up to the end of the list.

The algorithm is very simple, but both its average and worst case time complexities are O(N²) (quadratic), which make it fairly inefficient. Because of this, bubble sort is almost never used in real applications. However, one nice thing about bubble sort is that when applied to an already sorted list it only requires a single iteration to verify that the list is sorted. In this best case scenario the time complexity is O(1) (constant).

The version provided below includes an optimization that skips the already sorted elements at the end of the array during each successive iteration. This improves actual performance but does not change the overall time complexity.

/*
    Copyright 2019, Andrew C. Young <[email protected]>

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
*/

// Bubble Sort

// Exported Methods
    .global bsort

// Code Section

bsort:
    // Bubble sort an array of 32bit integers in place
    // Arguments: R0 = Array location, R1 = Array size
    PUSH    {R0-R6,LR}          // Push registers on to the stack
bsort_next:                     // Check for a sorted array
    MOV     R2,#0               // R2 = Current Element Number
    MOV     R6,#0               // R6 = Number of swaps
bsort_loop:                     // Start loop
    ADD     R3,R2,#1            // R3 = Next Element Number
    CMP     R3,R1               // Check for the end of the array
    BGE     bsort_check         // When we reach the end, check for changes
    LDR     R4,[R0,R2,LSL #2]   // R4 = Current Element Value
    LDR     R5,[R0,R3,LSL #2]   // R5 = Next Element Value
    CMP     R4,R5               // Compare element values
    STRGT   R5,[R0,R2,LSL #2]   // If R4 > R5, store current value at next
    STRGT   R4,[R0,R3,LSL #2]   // If R4 > R5, Store next value at current
    ADDGT   R6,R6,#1            // If R4 > R5, Increment swap counter
    MOV     R2,R3               // Advance to the next element
    B       bsort_loop          // End loop
bsort_check:                    // Check for changes
    CMP     R6,#0               // Were there changes this iteration?
    SUBGT   R1,R1,#1            // Optimization: skip last value in next loop
    BGT     bsort_next          // If there were changes, do it again
bsort_done:                     // Return
    POP     {R0-R6,PC}          // Pop the registers off of the stack

Quicksort

Quicksort is often the second sorting algorithm one learns. It works by first choosing a pivot value and then dividing the list of items to be sorted into two buckets based on whether they are smaller than or larger than the pivot value. It then repeats this process on each bucket until the entire list is sorted.

The average time complexity of quicksort is O(n log n) (loglinear), making it much faster than bubble sort. However, if the pivot selected at each step is either the largest or smallest value in the list, then it effectively becomes a bubble sort and its time complexity becomes O(N²). This can be avoided by improving the pivot selection process so that it never selects either the largest or smallest value.

The simplest implementation simply chooses either the first value in the list or the last value to use as the pivot. With this implementation, applying the algorithm to an already sorted list results in the worst case scenario in terms of time complexity. The code below avoids this by using the “median-of-three” pivot selection algorithm. This works by looking at the first, middle, and last values in the list and selecting the one that is the median of these three values as the pivot. In this way the code is guaranteed to never choose either the largest or smallest value in the list.

/*
    Copyright 2019, Andrew C. Young <[email protected]>

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
*/

// Quick Sort

// Exported Methods
    .global qsort

// Code Section
qsort:
    // Quick sort an array of 32bit integers
    // Arguments: R0 = Array location, R1 = Array size
    PUSH    {R0-R10,LR}         // Push registers on to the stack
    MOV     R4,R0               // R4 = Array Location
    MOV     R5,R1               // R5 - Array Size
    CMP     R5,#1               // Check for an array of size <= 1
    BLE     qsort_done          // If array size <= 1, return
    CMP     R5,#2               // Check for an array of size == 2
    BEQ     qsort_check         // If array size == 2, check values
qsort_partition:
    MOV     R1,#2               // Find the middle element
    SDIV    R2,R5,R1            // R2 = The middle element index
    LDR     R6,[R4]             // R6 = Beginning of array value
    LDR     R7,[R4,R2,LSL #2]   // R7 = Middle of array value
    SUB     R8,R5,#1            // R8 = Upper array bound index (len -1 1)
    LDR     R8,[R4,R8,LSL #2]   // R8 = End of the array value
    CMP     R6,R7               // Sort the values
    MOVGT   R9,R6
    MOVGT   R6,R7
    MOVGT   R7,R9
    CMP     R7,R8
    MOVGT   R9,R7
    MOVGT   R7,R8
    MOVGT   R8,R9
    CMP     R6,R7
    MOVGT   R9,R6
    MOVGT   R6,R7
    MOVGT   R7,R9
    MOV     R6,R7               // R6 = Pivot
    MOV     R7,#0               // R7 = Lower array bounds index
    SUB     R8,R5,#1            // R8 = Upper array bounds index (len - 1)
qsort_loop:
    LDR     R0,[R4,R7,LSL #2]   // R0 = Lower value
    LDR     R1,[R4,R8,LSL #2]   // R1 = Upper value
    CMP     R0,R6               // Compare lower value to pivot
    BEQ     qsort_loop_u        // If == pivot, do nothing
    ADDLT   R7,R7,#1            // If < pivot, increment lower index
    STRGT   R0,[R4,R8,LSL #2]   // If > pivot, swap values
    STRGT   R1,[R4,R7,LSL #2]
    SUBGT   R8,R8,#1            // if > pivot, decrement upper index
    CMP     R7,R8               // if indexes are the same, recurse
    BEQ     qsort_recurse
    LDR     R0,[R4,R7,LSL #2]   // R0 = Lower value
    LDR     R1,[R4,R8,LSL #2]   // R1 = Upper value
qsort_loop_u:
    CMP     R1,R6               // Compare upper value to pivot
    SUBGT   R8,R8,#1            // if > pivot, decrement upper index
    STRLT   R0,[R4,R8,LSL #2]   // If < pivot, swap values
    STRLT   R1,[R4,R7,LSL #2]
    ADDLT   R7,R7,#1            // if < pivot, increment lower index
    CMP     R7,R8               // if indexes are the same, recurse
    BEQ     qsort_recurse
    B       qsort_loop          // Continue loop
qsort_recurse:
    MOV     R0,R4               // R0 = Location of the first bucket
    MOV     R1,R7               // R1 = Length of the first bucket
    BL      qsort               // Sort first bucket
    ADD     R8,R8,#1            // R8 = 1 index past final index
    CMP     R8,R5               // Compare final index to original length
    BGE     qsort_done          // If equal, return
    ADD     R0,R4,R8,LSL #2     // R0 = Location of the second bucket
    SUB     R1,R5,R8            // R1 = Length of the second bucket
    BL      qsort               // Sort second bucket
    B       qsort_done          // return
qsort_check:
    LDR     R0,[R4]             // Load first value into R0
    LDR     R1,[R4,#4]          // Load second value into R1
    CMP     R0,R1               // Compare R0 and R1
    BLE     qsort_done          // If R1 <= R0, then we are done
    STR     R1,[R4]             // Otherwise, swap values
    STR     R0,[R4,#4]          //
qsort_done:
    POP     {R0-R10,PC}         // Pop registers off of the stack and return

Heapsort

Heapsort works by building a heap out of the array. Once an array has been converted to a heap, the first item is guaranteed to be either the smallest or largest value depending on whether the heap is a minimum or maximum heap, respectively. The heapsort algorithm uses a maximum heap and therefore the first value in the array will always be the largest value after the heap is created. It then swaps this value with the value at the end of the array and re-creates the heap ignoring the now properly sorted last element. The algorithm thus builds the sorted array in place from the end forward. When there are only two items left to be sorted it performs a simple comparison and swap.

The best, worst, and average time complexities of heapsort are all O(N log N). This is because building the heap has an average time complexity of O(log N) and this process is performed N times as the array is sorted.

/*
    Copyright 2019, Andrew C. Young <[email protected]>

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
*/

// Heapsort

// Exported Methods
    .global hsort

// Code Section

hsort:
    // R0 = Array location
    // R1 = Array length
    PUSH    {R2-R12,LR}         // Push registers on to the stack
    CMP     R1, #1              // If the array has size 0 or 1, return
    BLE     hsort_done
    CMP     R1, #2              // If the array has size 2, swap if needed
    BLE     hsort_simple
    MOV     R11, #1             // R11 = 1 (constant)
    MOV     R12, #2             // R12 = 2 (constant)
    MOV     R10, #-1            // R10 = -1 (temporarily)
    UDIV    R10, R1, R12        // R10 = (len / 2) - 1 (start heapify index)
    SUB     R10, R10, #1
    SUB     R9, R1, #1          // R9 = len - 1 (start pop index)
    MOV     R2, R10             // R2 = Starting heapify index
heapify:
    // R2 = Index
    MOV     R3, R2              // R3 = Index of largest value
    MLA     R4, R2, R12, R11    // R4 = Left index (2i+1)
    MLA     R5, R2, R12, R12    // R5 = Right index (2i+2)
    LDR     R6, [R0, R2, LSL#2] // R6 = Index value (R0 + (R2 * 4))
    MOV     R7, R6              // R7 = Current largest value
heapify_check_left:
    CMP     R4, R9              // Check if left node is in array bounds
    BGT     heapify_check_right // If not then check the right node
    LDR     R8, [R0, R4, LSL#2] // R8 = Left value (R0 + (R4 * 4)
    CMP     R8, R7              // Compare left value to largest value
    BLE     heapify_check_right // If left <= largest check right
    MOV     R3, R4              // If left > largest, change largest index
    MOV     R7, R8              //    and copy the value
heapify_check_right:
    CMP     R5, R9              // Check if right node is in array bounds
    BGT     heapify_swap        // If not then continue
    LDR     R8, [R0, R5, LSL#2] // R8 = Right value (R0 + (R5 * 4))
    CMP     R8, R7              // Compare right value to largest value
    BLE     heapify_swap        // If right <= largest then continue
    MOV     R3, R5              // If right > largest, change largest index
    MOV     R7, R8              //    and copy the value
heapify_swap:
    CMP     R2, R3              // Check if largest is the initial value
    BEQ     heapify_next        // If so, exit the heapify loop
    STR     R7, [R0, R2, LSL#2] // If not, swap largest and initial values
    STR     R6, [R0, R3, LSL#2]
    MOV     R2, R3              //    and change index to largest index
    B       heapify             //    and recurse
heapify_next:
    CMP     R10, #0             // If last index was 0, heap is finished
    BEQ     heapify_pop
    SUB     R10, R10, #1        // Else, heapify next index
    MOV     R2, R10
    B       heapify
heapify_pop:
    LDR     R3, [R0]            // R3 = Largest value (front of heap)
    LDR     R4, [R0, R9, LSL#2] // R4 = Value from end of heap
    STR     R3, [R0, R9, LSL#2] // Store largest value at end of heap
    STR     R4, [R0]            // Store end value at start of heap
    SUB     R9, #1              // Decrement end of heap index
    CMP     R9, #1              // If there are only two items left, compare
    BEQ     hsort_simple
    MOV     R2, #0              // Else, heapify from the start of the heap
    B       heapify
hsort_simple:
    // Sort a list of exactly 2 elements
    LDR     R2, [R0]            // First value
    LDR     R3, [R0, #4]        // Second value
    CMP     R2, R3              // Compare values
    STRGT   R3, [R0]            // Swap if out of order
    STRGT   R2, [R0, #4]
hsort_done:
    POP     {R2-R12,PC}         // Return

Radix Sort

Radix sort works a bit differently from other sorting algorithms. Like quicksort it works by first dividing the unsorted list into two buckets and then iterating over each bucket. However, rather than using a comparison function to determine whether a given value is larger or smaller than another value it uses the radix (or base) to determine which bucket each value should go in.

For example, if we were going to sort a deck of 100 cards numbered 0-99, we might perform the sort like this:

Divide the cards into piles based on their first digit (0-9), left-to-right. This results in 10 piles.
For each pile, divide those cards into piles based on their second digit (0-9), left-to-right. This results in 10 new piles for each of the first piles, each with only a single card in it.
Reassemble the deck by picking up the cards from left to right and the cards will be in order.

As shown by the example, the number of buckets used at any given step depends on the radix (or base) being used. If the numbers are represented in base 10 (decimal), then each iteration produces 10 buckets. However, if the numbers are stored in base 2 (binary), then only two buckets are produced. Likewise, the data can be sorted either by the most significant digit (MSD) as in the example above, or by the least significant digit (LSD).

The code below implements an in-place binary MSD radix sort, also called a binary quicksort. Although sorting unsigned integers works as expected, sorting signed integers will result in the negative integers being sorted after the positive integers. This is due to the fact that negative signed integers have their most significant bit set to 1. This code could also probably be further optimized by moving the swap logic in-line, but I think it is more readable this way.

/*
    Copyright 2019, Andrew C. Young <[email protected]>

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
*/

// Radix Most Significant Bit In-Place Sorting Algorithm

// Exported Methods
    .global rsort
    .global swap

// Code Section

rsort:
    // Radix MSD sort an array of 32bit integers
    // Arguments: R0 = Array location, R1 = Array size
    PUSH    {R0-R2,LR}          // Push existing registers on to the stack
    CMP     R1,#1               // Check for an empty or single member array
    POPLE   {R0-R7,PC}          // If so, return to where we came from
    ADD     R1,R0,R1,LSL #2     // R1 = End of the array (R0 + (R1*4))
    SUB     R1,R1,#4            //
    MOV     R2,#1               // R2 = Bitmask
    LSL     R2,R2,#31           //   most significant bit
    BL      rsort_recurse       // Begin recursion
    POP     {R0-R2,PC}          // Pop the registers off of the stack

rsort_recurse:
    // Radix MSD sort an array of 32bit integers (recursive helper)
    // Arguments: R0 = Array start, R1 = Array end, R2 = Bitmask
    PUSH    {R0-R8,LR}          // Store previous values
    SUB     R8,R1,R0            // Check array length
    CMP     R8,#4               //
    BLT     rr_done             // Return if array is empty or has 1 entry
    MOV     R3,R0               // R3 = 0s bin pointer (start of array)
    MOV     R4,R1               // R4 = 1s bin pointer (end of array)
    MOV     R5,#0               // R5 = Current value
    MOV     R6,R0               // Set original array start (debug)
    MOV     R7,R1               // Set original array end (debug)
rr_0loop:
    LDR     R5,[R3]             // Load next value from pointer
    TST     R5,R2               // Check bitmask
    BEQ     rr_0loop_next       // If the value is 0, loop
    BL      rr_swap             // If not, swap values
    B       rr_1loop_next       // Switch to the 1s bin
rr_0loop_next:
    ADD     R3,R3,#4            // Increment the pointer
    CMP     R3,R4               // Check if pointers are the same
    BGT     rr_next_bit         // If so, move to the next bit
    B       rr_0loop            // If not, check the next value
rr_1loop:
    ADD     R12,#1              // Increment loop counter (debug)
    LDR     R5,[R4]             // Load current value from pointer
    TST     R5,R2               // Check bitmask
    BNE     rr_1loop_next       // If the value is 1, loop
    BL      rr_swap             // If not, swap values
    B       rr_0loop_next       // Switch to the 0s bin
rr_1loop_next:
    SUB     R4,R4,#4            // Decrement the pointer
    CMP     R3,R4               // Check if pointers are the same
    BGT     rr_next_bit         // If so, move to the next bit
    B       rr_1loop            // If not, check the next value
rr_next_bit:
    LSR     R2,R2,#1            // Update bitmask to next bit
    CMP     R2,#0               // Check for all zeros
    BEQ     rr_done             // If so, do not recurse anymore
    MOV     R0,R6               // Array start
    MOV     R1,R3               // 0s bin array end
    SUB     R1,#4
    BL      rsort_recurse       // Sort 0s bin
    MOV     R0,R4               // 1s bin array start
    ADD     R0,#4
    MOV     R1,R7               // Array end
    BL      rsort_recurse       // Sort 1s bin
    B       rr_done             // All done
rr_swap:
    PUSH    {LR}                // Store LR
    MOV     R0,R3               // End of 0s bin
    MOV     R1,R4               // End of 1s bin
    BL      swap                // Swap values
    POP     {PC}                // Return
rr_done:
    POP     {R0-R8,PC}          // Restore previous values

swap:
    // Swaps the values from 2 memory locations
    // R0 = 1st memory location
    // R1 = 2nd memory location
    PUSH    {R0-R4,LR}          // Store previous values
    LDR     R3,[R0]             // Load value from the first memory location
    LDR     R4,[R1]             // Load value from the second memory location
    STR     R3,[R1]             // Store value 2 in memory location 1
    STR     R4,[R0]             // Store value 1 in memory location 2
    POP     {R0-R4,PC}          // Return

Additional Links and Videos

For more details on common sorting algorithms, including visualizations, I suggest checking out the Sorting Algorithm Animations page by Toptal. I also recommend the Sound of Sorting website, which is where I found many of the videos in this article. They have a YouTube playlist that includes visualizations of many different sorting algorithms. Finally, check out AlgoRythmics, who produced the dancing videos.