I recently mentioned a personal project I’m working on to create a text editor, which I’m calling nabu. Tonight, I thought I’d take some time off from writing code to talk a bit about the data structure I’m using for the nabu character buffer: the gap buffer.
At its core, a character buffer is just a mutable list of characters. A list is an abstract data structure that holds an indexed set of elements. A mutable list can be modified, so you can replace any element with a new element, as well as add and remove elements at any index. A key feature of a list is random access, meaning you can access any element from the set by using its index.
There are lots of ways to implement a list data structure. The most straight forward is an array list, where a sequence of consecutive memory locations are used to store the elements, one element per location. An array list is really simple, and really memory efficient. It’s really fast to iterate over, great for random access, and modifying existing elements is extremely quick. However, inserting and deleting elements is generally quite slow. In the worst case for an insert, you need to reallocate the entire buffer (in order to get enough space for the new element), which involves copying every existing element to the new location, and then copy every element again to move them out of the way for the new element. Deletion is similar, though doesn’t usually require reallocation.
Another implementation option is a linked list, where the elements are stored in a set of nodes. Each node stores a single element as well as a pointer (reference) to the next node in the list. Linked lists are pretty efficient at element modification, insertion, and deletion. They’re relatively quick at iterating, but terrible at random access (in its simplest implementation, a linked list requires iterating all the way from the beginning to the element in question). They can also require a significant amount of memory overhead; in the case of a character buffer, a linked list can easily double the memory requirements, or worse (e.g., adding a 32-bit pointer for each 32-bit Unicode character).
Insertion and deletion are critically important for a text editor, so an array list would be a costly choice of implementation. A linked list would be better in this regard, but the memory overhead would be significant, as would the slowness of random access (imagine moving the cursor from the beginning of a document to the end by iterating one character at a time).
A third way to implement a list is called a gap buffer. Structurally, it’s very much like an array list but with a simple modification that makes insertion and deletion much cheaper. For this reason, gap buffers are commonly found in text editor implementations, including my own implementation in nabu.
So what is a gap buffer? Simply put, it’s an array list with a gap in it. Imagine you have an array list but you know you’re going to be appending some elements to it. You can pre-allocate additional space at the end of the array for the future elements. Conceptually, the buffer now has a gap at the end that can be filled in with added elements.
Well a gap buffer uses the same concept, except the gap can be anywhere in the buffer. Think of the gap as your caret, the location in the buffer where editing can occur (which you can move, with a bit of work, anywhere in the buffer to edit in different locations). To insert new elements (e.g., typing some text), write your data to the unused locations at the beginning of the gap. To delete characters, simply expand the gap into the characters to be deleted.
The gap is tracked with a pair of fields, typically the starting and ending location of the gap (alternatively, the starting location and length of the gap). For an empty buffer, the gap is the entire buffer (e.g., starting location is 0 and the ending location is the length of the buffer).
Remember, the gap is like your editing caret. To insert a character at the caret location, write it to the starting location of the buffer and increment the starting location of the buffer by one, so it now starts immediately past the inserted character. A subsequent insert will add the new character immediately after the previous one (and move the start of the gap, the caret, forward by one again), just as you would expect if you were typing.
To delete the character ahead of the caret (typical behavior of the “delete” key on PCs), simply increment the ending location (or the length) of the gap, so the gap expands into the deleted character. To delete a character behind the caret (“backspace” on PCs), simply decrement the starting position of the gap by one.
Even moving the caret forwards or backwards is fast and easy, as long as you’re willing to do it one character at a time. To move forward, you copy the next character after the gap to the starting location, and then increment both the starting and ending location of the gap. To move backwards, you copy the character immediately before the gap to the last position in the gap, and then decrement both indexes by one.
Moving the caret anywhere else in the buffer is possible as well, it just requires more copying. If you’re moving the caret forward, you need to copy all the characters from the current end of the gap to where the end of the gap will be after the move backwards to fill in the current gap and open up the new gap. Moving the caret backwards is basically the same except you need to copy from start of the new to start of the old and this time you’re moving them forward to fill in the old gap. The operation is O(d+g), where d is the distance you want to move by, and g is the size of the gap.
Of course you will need to handle edge cases, like if you’re trying to insert but your gap has size 0, or you’re trying to delete ahead when the gap is at the end of the buffer. But overall, a gap buffer is a simple and fairly efficient data structure for a character buffer, especially for text editing applications where modifications tend to be close together.