Programming fundamentals - Part 09: Data structures and algorithms (abridged)

Data structures and algorithms are often known as the first real test of a computer science student, which weeds out the people that aren't really into programming that much. And it certainly is a complex topic, which is why we will only scratch the surface and try to look provide a general overview to serve as a jumping off point for further reading.

We will have a look at the basics comparing algorithms and data structures, look at a few examples and go over some general paradigms for algorithm and data structure design. As mentioned, it's recommended to do further reading on the mentioned topics, since a comprehensive description would easily fill a computer science text book by itself.

Definition of data structures and algorithms

An algorithm is a description of the steps needed to solve a specific problem. That means that all programs are basically algorithms. However, when we are talking about algorithms, we are usually referring to a solution to a common problem and most often to an algorithm someone else has already invented.

As an example, one of the most common problems in computer science has been sorting data. Ability to process larger amounts of data meant that efficiently putting that data in order became an early area of algorithm design and therefore most sorting algorithms are quite old. People very rarely create custom sorting algorithms and usually implement an existing sorting algorithm instead.

Algorithms are of course not limited to sorting. There are algorithms for finding data efficiently, for finding paths through graphs, for mathematical computations and so on.

Data structures refer to the kinds of composite data structures, which are specifically structured to allow a problem to be solved efficiently. Most problems could be solved with nothing but lists, but such solutions might be too slow or difficult to understand. A properly designed data structure serves the algorithm by providing the right data quickly and accurately, while maintaining a clear interface for what actions are carried out.

In a nutshell, data structures rely on assumptions that can be made about how the data is laid out in order to provide desired performance.

Complexity analysis

We already mentioned the topic of efficiency in relation to comparison of algorithms and data structures. In computing, efficiency comes in two forms: how much time it takes to solve the problem and how much data storage it takes to solve the problem. These two dimensions are called "time complexity" and "space complexity" respectively. An optimal solution to a specific problem minimizes both, but often you can trade a bit of space for reduced time and vice versa, so sometimes the two dimensions are in contention.

In order to compare different approaches to solving a problem, we have the tool of complexity analysis, which is a generally agreed upon mathematical framework for expressing time and space complexity of an algorithm. This framework doesn't directly compare seconds or bytes, since those are often affected by the runtime environment of the program. Instead, complexity analysis is based on the idea of scaling limits, where the most important question is how well the solution scales when input data grows.

The complexity is expressed with the Big O notation, which can be thought of as a mathematical function expressing the growth of time or space as input increases. It is possible to calculate and prove the Big O complexity value for an algorithm, but we will leave that as an exercise to willing readers.

For our purposes, we are concerned with just the typical classifications, which in order from best scaling to worst are:

O(1) - constant time/space, the algorithm is not affected by larger input
O(log n) - logarithmic time/space requirement that grows slower than input size
O(n) - linear time/space, scales at same rate as input size
O(n log n) - time/space requirement grows faster than input size, but typically scales to large inputs
O(n^2) - quadratic time/space, fast growth relative to input size
O(n!) - factorial time/space, computation of even small inputs may require significant resources

Other classifications exist, but these are the ones that you are most likely to run into. At the simplest level you can just take these as a method of ranking, with one caveat: the Big O notation is only concerned with scaling, which means that there are algoritms and data structures which have notionally better scaling, but which may be slower for majority of typical inputs than ones with a worse scaling factor. Big O and benchmarking are two different topics and substituting one for the other only provides partial answers.

That should be all the computational complexity theory you need for now as we start looking at different data structures and algorithms.

Common data structures

Linear data structures

The most basic data structure is one that you are already familiar with, namely the linear table, also known as an array or a list.

Arrays have three big benefits going for them: they are simple, they group things up close together and accessing specific elements in them are fast. Adding and removing elements from the end of an array or indexing into the array are O(1) operations.

However, where arrays run into more trouble are insertions into the middle and searching for a specific element. In order to insert an element into the middle of an array, all of the other elements after need to be moved forward, which in our categorization is an O(n) operation. Searching for a specific element is similarly going to require looking at potentially all of the elements before you find what you are looking for. This actually also brings us to our first algorithm: the linear search:

function linearSearch(array, elem)
   for e in ipairs(array) do
      if e == elem then
         return e
      end
   end
   return nil
end

local result = linearSearch({1, 7, 9, 3, 5, 0, 2}, 5)
print(result)

However, we can make searching a fair bit faster if the list is in order with another fairly simple searching algorithm. In a sorted list we can rely on the fact that smaller elements will be on one side of the array and larger ones on the other side of the array. So, by guessing that the element is in the middle and comparing that element to the element we are looking for. If we do that repeatedly, every time we end up cutting the potential range down in half. This gives us a search that will complete in logarithmic, or O(log n) time, called binary search:

function binarySearch(array, elem)
   local left = 1
   local right = #array

   while left ~= right do
      local i = math.floor((left + right) / 2)
      local e = array[i]

      if e == elem then
         return e
      elseif e < elem then
         left = i + 1
      else
         right = i
      end
   end
end

local array = {1, 7, 9, 3, 5, 0, 2}
table.sort(array)

local result = binarySearch(array, 5)
print(result)

That comes with the caveat that in order for the search to be O(log n) you have to have the list already pre-sorted. If you need to sort the list first, then you have to also account for the time complexity of sorting.

Typical sorting algorithms have a time complexity of O(n log n), so if we need to both sort the list and then search the list, the combined time complexity is not O(log n) but O(n log n), because the larger complexity becomes dominant. When combining algorithms like this, you typically just discard the smallest complexity.

Now, in addition to just storing data into a linear array, we can also turn the array into a couple of other data structures. One common one is a stack, which is a data structure that stores items in a last-in-first-out (LIFO) order. This can be useful in a variety of tasks. For example, I have used stacks in game development to maintain the list of various game states and transitions between them.

The stack provides two operations: push and pop. Push allows us to add a value to the stack and pop will take the last added item off the stack. Linear tables already provide us functions to make this pretty easy:

function push(stack, elem)
   table.insert(stack, elem)
end

function pop(stack)
   return table.remove(stack)
end

local stack = {}
push(stack, 1)
push(stack, 2)
push(stack, 3)

print(pop(stack))
print(pop(stack))
print(pop(stack))

3
2
1

It is also possible to create a first-in-first-out (FIFO) list, or a queue on top of an array. A naive version would have simply insert at the back of the array and remove items at the front, which you certainly can do with just table.insert() and table.remove(), but removal from the middle or beginning of an array has an O(n) cost. There are a couple of ways that the cost can be reduced. The simplest queue construct is one known as a ring buffer, or circular buffer and it works as long as we know ahead of time how many items we will have in our queue at most.

Instead of removing items from the front, we maintain two moving indices: one for writing and one for reading. If either one of these goes beyond our maximum value, we circle back around to the beginning of the array. Enqueuing a new value moves the write index forward and dequeueing moves the read forward. If the two indices are pointing to the same location, the ring buffer is empty. Let's implement a ring buffer to demonstrate:

local ringBuffer = {
   writeIndex=1,
   readIndex=1,
   array={0,0,0,0,0,0,0,0,0,0} -- capacity for 10 items
}

function enqueue(buffer, elem)
   buffer.array[buffer.writeIndex] = elem
   buffer.writeIndex = (buffer.writeIndex + 1) % #buffer.array
end

function dequeue(buffer)
   if buffer.writeIndex == buffer.readIndex then
      return nil
   end

   local result = buffer.array[buffer.readIndex]

   buffer.readIndex = (buffer.readIndex + 1) % #buffer.array

   return result
end

enqueue(ringBuffer, "Hello")
enqueue(ringBuffer, "all")
enqueue(ringBuffer, "world")

print(dequeue(ringBuffer))
print(dequeue(ringBuffer))
print(dequeue(ringBuffer))

Hello
all
world

A ring buffer allows us to operate queues with O(1) time complexity for the enqueue() and dequeue() operations, which is a fair bit faster than paying the O(n) cost for each element we want to take out of the queue.

Hash tables

Hash Tables, which in Lua are just the regular key-value tables, provide a very fast way to check for the existence of an element by its key and very fast retrieval of data. We've covered this in our tangent in a prior chapter, but essentially they work by utilizing a function that will convert the key into a semi-unique numeric value - the hash - and then index the data by that value.

This means that in terms of complexity, storing and retrieving values is almost O(1). It's worth stressing that hash table operations aren't exactly O(1), because sometimes multiple values may map into the same hash and because the hash table may need to grow over time. Both of these cases require special handling and that handling takes a bit longer than O(1) time. However, with a reasonably empty hash table and a good hashing function, these special cases happen so rarely that averaged over time, the observable complexity is close to O(1). This is called an "amortized" O(1) complexity.

Hash Tables are a quick and simple solution to a lot of problems, if you just need to make a quick optimization to lookups. You can, for example, set up a hashmap to act as a cache for heavy functions by storing the results of said calculation in a hash table and using the parameters of the calculation as the key. This way over time the hash table works like a fast version of the function. So, if you are solving a difficult problem and just need one simple answer to speeding it up, tactical use of a hash table is often a good place to start. It's worth noting that they are not a universal answer to everything though and hash tables do require a fair amount of extra memory, so for situations where memory requirements are tight, they may not be suitable.

Linked lists, trees and graphs

In addition to flat arrays, there is another way to represent a list of elements, one that is also very old and still often used, although its practical benefits have lessened over time. That is the linked list.

A linked list is a data structure consisting of nodes, where each node stores a value and at least a reference to the next node in the list. A linked list with just a link to the next node is called a "singly-linked" list, whereas if we also store a reference to the previous node, that would be called a "doubly-linked" list.

We can implement a basic singly-linked list in Lua like this:

function makeNode(element, nextNode)
   return {value=element, nextNode=nextNode}
end

function printLL(list)
   local currentNode = list

   while currentNode do
      print(currentNode.value)
      currentNode = currentNode.nextNode
   end
end

local myList = makeNode(1, makeNode(2, makeNode(3, nil)))
printLL(myList)

1
2
3

Linked lists can be better than arrays on paper, since inserting an element anywhere into the list is an O(1) operation: you only need to modify the nextNode value of the previous list node to the new one in order to either append to the end of the list or into the middle. If you want to add an element to the start of the list you have an even easier time, just create a new node pointing to the previous list. However, indexing into the middle of a linked list is not that easy, since you will need to manually traverse the list node by node until you get to where you want to be, which has a complexity of O(n).

This is unfortunately where reality diverges heavily from theory, since in practice traversing linked lists is so much slower that the O(n) insertion time of arrays is often a better trade in terms of wall-time than the O(1) insertions of a linked list provide. This is due to modern systems utilizing caching, which means that reading different parts of the memory can be slower or faster, and unfortunately linked lists often result in sub-optimal memory access patterns. Modern computers love to access data sequentially, particularly forward, and nodes of a linked list can be laid out almost arbitrarily in memory. These days linked lists are mostly used in niche cases or when the inherent simplicity is worth the cost in runtime.

This doesn't mean that linked lists are useless though, since the linking behavior allows them to be turned into a few variant data types which are very useful.

Instead of storing just a reference to the next node, let's imagine a linked list that has two references: left and right. Like so:

local data = {value=1, left=nil, right=nil}

Now let's make some functions that will attach data to either of those references:

function insertLeft(node, newNode)
   node.left = newNode
end

function insertRigth(node, newNode)
   node.right = newNode
end

Now each sub-node can also have its own left and right references and they can have theirs and so on. What we have created with this simple modification is a tree, or more specifically, a binary tree. Trees can be very useful, since they are very good at slicing data structures into smaller parts to avoid processing the rest of the data. For example, you can represent a key-value store using a binary tree by sorting subsequent nodes of the tree so that keys that are less than the parent node go on the left and keys that are larger than the parent node go on the right. If you maintain that property, then looking up a specific key will always skip approximately half of the remaining data by simply comparing the key to the value of the current tree node and then moving onto the left or the right side accordingly. This gets you an O(log n) lookup time for data.

Trees in general are also a good representation for various types of data. The HTML documents on web pages are trees of HTML elements, filesystem directories are trees of directories and files and programming languages interpreters and compilers often process programs in the form of abstract syntax trees. Trees are simply a very versatile way to represent data.

However, there is an even more versatile option. Trees allow you to represent branching data, but what if you want to have arbitrary references between any nodes? We have a tool for that too, and it's called a graph.

A graph is a structure of nodes, each of which may have an arbitrary number of references to any other node in the graph. It is a kind of super-type of trees and lists. You can use them to represent lists or trees, but where they shine the most is in representing arbitrarily complex relationships between data.

One example of a graph might be a dungeon in a video game, consisting of many rooms, each of which has some number of doors and teleporters leading to other rooms. And just like a human might get lost in such a dungeon, graphs can also be quite complicated for computers to navigate. Pathfinding algorithms are their own category of algorithms intended to solve exactly that problem. Especially if you are into game development, it's worth having a read through Dijkstra's algorithm and its variant A* search algorithm to find out how to traverse graphs efficiently.

Trees and graphs are solutions to many tricky algorithms puzzles, so it's worth being at least somewhat familiar with them. Some tree representations are also able to do away with some of the negative aspects of references by tightly packing the data into arrays in order to produce some really cool data types. One of my favorites is the heap data structure, which represents data as a priority queue by forming it into a tighly packed binary tree on top of an array. Such priority queues that allow finding the highest or the lowest value of a data set quickly are very common in algorithms puzzles in particular.

Design paradigms for algorithms

Since there are simply so many algorithms for various different uses, exhaustively documenting them here is rather impossible. But in addition to some of the algorithms already mentioned, I figured I would go over some popular paradigms that drive algorithm design. Many algorithms fall under at least one of these, although this set isn't fully exhaustive either. Hopefully they will provide some use in case you ever need to develop an algorithm for solving some kind of a problem.

Brute-force

Sometimes the simplest approach is the best, and brute-force is about as simple as it gets in terms of algorithms design.

Brute-force algorithms are ones that go through the entire solution space by checking each possible solution one-by-one until a solution is found. Linear search of an array is one example of a brute-force algorithm.

Brute-force algorithms are usually not fast, but they are quick to implement. There are also problems for which no solution is known aside from brute-force. For smaller sets of data and smaller solution spaces, the brute-force solution might also be perfectly fine, as the amount of practical wall-time spent on a sub-optimal solution is negligible.

Brute-force algorithms do require you to be careful about making sure you correctly produce the set of possible solutions without skipping any or repeating the same solution multiple times. Spending a little bit of time on figuring out how to represent solution options as sequences is a good idea to avoid incomplete or redundant searches.

Greedy algorithms

Greedy algorithms are ones that assume the option that looks best is probably going to be the best. They essentially attempt to score each way to proceed and then pick the one with the best score. Often the score is acquired via some kind of a "heuristic", a function that tries to guide the algorithm towards the best options without necessarily guaranteeing the option truly is the best.

Because greedy algorithms are at least moderately short-sighted, their output is not necessarily always optimal, although in some cases it has been shown that a greedy algorithm produces the optimal result. Instead, their results are usually "good enough". Depending on the complexity of the scoring system, "short-sightedness" may also be a fairly relative concept. Chess bots analyze possible moves much further than humans, but even then they are not capable of checking all of the possibilities. Instead they check some moves ahead in the future and pick the move that shows most promise based on that analysis.

Greedy algorithms are also used in path-finding and both Dijkstra and A* are based on the idea that the path that appears shortest probably is the shortest.

Recursion and divide-and-conquer

Some problems can be difficult to reason about as a whole, but become much more manageable by splitting the task into smaller sub-tasks and solving each of those. Sometimes the sub-tasks also are just a smaller version of the larger tasks, meaning that the solution is recursive.

As an example, consider sorting a list. You can formulate the problem such that a sorted list is a list where smaller elements are on one side and larger elements on the other side and both sides are sorted.

In a pseudo-code representation, we could therefore create the following solution:

function sort(list)
   local pivot = list[1]

   local leftSide = findElementsSmallerThan(pivot, list)
   local rightSide = findElementsLargerThan(pivot, list)

   sort(leftSide)
   sort(rightSide)

   return concatenate(leftSide, pivot, rightSide)
end

That is, essentially, the implementation of QuickSort. As you can see, we solved the problem by calling the sort() function inside of itself. This is called recursion, and despite its notoriety, it is a mechanism that lends itself well to dividing and conquering problems by representing the solution as smaller and smaller sub-problems that we can tackle by using the same logic. It also works really well on problems related to trees, since a tree is basically just a bunch of smaller and smaller sub-trees and a solution on a sub-tree can be expanded to solve a problem on the whole tree.

Dividing and conquering is in general a good technique for solving problems or doing big things in general. By slicing the task into smaller, more achievable pieces, you can start chipping away at it, whereas trying to tackle the whole thing at once would be daunting and possibly terrifying.

How do you eat an elephant? One bite at a time.

Monte Carlo

Monte Carlo method is not one that I have used many times, but it has been a joy when I have gotten to use it, because it makes it possible to represent extremely complex problems in a very simple way.

The Monte Carlo method is intended to solve problems using probabilities, which means that it doesn't guarantee the best result or even the correct result. However, it allows you to figure out probable answers or model the probability of an outcome.

The method itself is deceptively simple: you take an event with some probablity and you either execute that event, or more likely a simulation of the event, a large number of times. If you are dealing with events that depend on each other, you simulate the events based on their dependency. Then you count the resulting outcomes, which gives you a probability distribution of the outcomes.

One idea I have played with is project estimation. Assuming you have a project consisting of tasks that depend on each other in some kind of a tree structure, how long will the project likely take as a whole?

Since each task may end up taking varying amounts of time, we provide an optimistic estimate and a pessimistic estimate for each task. Then we can essentially simulate different "timelines" of the project by randomly picking a time for each task between the optimistic and pessimistic estimate and adding the time spent on dependent tasks together. This will not give you an exact time for how long the project will take, but assuming the given initial estimates were in the right ball-park, you should be able to give estimates of different levels of confidence. For example, you could say that at 90% confidence, the project will take less than 6 months. Or that with 50% confidence the project will be completed in the next 3 months.

The Monte Carlo method, despite its simplicity, is very versatile and only really requires you to know how to simulate the events at your desired level of granularity and a bunch of repetitions to produce results. I am also pretty bad with probabilities, so computing things by repetition helps me reason with those things much better than by trying to remember the things from probabilities courses.

Wrap up

This has been a really heavy chapter, so you can pat yourself on the back for making it through. I am not sure how well this ended up coming together and I had a few moments of doubt while writing it, but I hope you got something out of it.

As always, you don't fully learn something without practice and the same applies to data structures and algorithms. And to even have a full grasp of the concepts, you really should do some further reading on areas that interest you. I am hoping you are better able to tackle other texts on the topic with the info presented here.

The good thing is that practicing DSA is very simple, since most coding challenge platforms are basically just DSA problems. Leetcode, Codewars, and Advent of Code can be nice ways to put your skills to the test and write efficient solutions to problems. However, if that feels hard, you shouldn't despair. Disproportionate focus is put on difficult DSA problems in the industry and in most cases you are not going to be dealing with strict time and space limitations, so even a slightly sub-optimal solution will often do. In fact, almost all of the times a clear and easy to understand solution beats a clever and inscrutable one when you are writing practical software.

Next time we will start delving into the concept of object-oriented programming and how to model your data as entities with associated methods for interacting with them. See you then!

>> Home