1. Introduction
In this tutorial, several common and useful data structures, namely vectors, arrays, linked lists, trees, graphs, and stacks are presented.
Our aim is to present a general, programming language-independent description of the data structures and to give an indication of their uses.
2. Arrays
An important data structure in computer science is the array, a consecutive set of fields.
Each field has a given size in memory. The fields themselves can be, for example, bytes, numbers, data structures of fixed size or even pointers to more complex data structures.
We assign each consecutive field with an index starting with 0 as the first. We can access the nth field in the array by multiplying n by the size of the field and adding that to the starting position.
Let’s see how we can find the 4th field in an array representing a one-dimensional array of numbers:
In a computing language, the fourth element can be given in a notation like V(4) or even V[4].
2.1. Matrices
A matrix is a special case of the array representing a 2-dimensional set of fields.
We can think of a matrix as an array of arrays:
Each field of the “first” array of the 2-dimensions holds a “second” array. We access the first array of fields by the first dimension. We use the second dimension to find the desired field within the “second” array:
Let’s see an example where we have a matrix, M, as an array of elements, each themselves with five elements. We can access the second element in the first vector, namely M(1,2), in the following way:
When we represent a matrix as an array in memory, we have the following:
We access the second number in the first vector by adding to the matrix start, the size of the vector plus two. In this case, our array field is itself an array of five numbers.
2.2. Uses of Arrays and Matrices
The array is a fundamental data structure in computer science. and its uses are boundless.
We can represent any set of data structures, ordered or unordered, as an array. Whenever we have a set or a list of the same objects, whether it be numbers, names, cities, jobs to do, etc., we can represent them as an array.
The array holds the list or set of things we want to store. For example, if we have a list of numbers, each field of the array would hold one of the numbers.
What we’re storing can also be complex. Suppose we’ve got a list of contact information for a set of employees. Each field of the array would be a data structure describing a given employee.
A spreadsheet is basically an array in matrix form. We access and manipulate the rows and columns of the spreadsheet by accessing the two-dimensional indices.
Any numerical calculation involving applications of linear algebra are represented as arrays of numbers. Applications range from vector and matrix multiplication, solving both linear and non-linear equations, statistics and even the data in machine learning.
3. Graphs
A graph is another useful data structure. It stems from graph theory.
A graph is a set of nodes – also called vertices – connected by edges – lines that connect the nodes.
We can assign properties to both the nodes and edges. In the language of graph theory, we say that a property is a color. If, for example, we assign the property “1” to a node, we say that the node’s color is “1”.
We can make the edges directed, meaning we can make one node point to another node or we can set the node of both to point to each other.
Below, we can see a graph with four nodes. Two nodes are colored with the labels “Jill” and “Jack”. We don’t have any colors for the other two nodes.
We created three types of edges. From “Jack”, we created a directed edge with an arrow and an undirected edge with no arrow. Between “Jack” and “Jill”, we created a node which points in both directions, and we labeled that node with the color, “related”:
Among other things, edges can indicate traversal patterns.
For example, if each node represents a website, a directed edge from site A to site B could indicate that there is a link on site A going to site B.
Using a graph of directed edges, we can think about whether or not one site is reachable from another and, if so, what’s the fewest number of clicks to get there.
3.1. Trees
A tree is a special type of graph. In graph theory, a tree is a graph that does not have cycles, or in other words, a connected acyclic graph.
When we traverse the tree through the edges, we only use an edge once and we don’t come to a node we’ve already been to.
Below, we present the same graph in two different ways. Regardless of how it’s drawn, it’s still a tree with no cycles.
The tree on the right shows a typical way to draw a tree, starting with the top node or root node with a path to the leaves. The leaves are the end of the path:
Another important concept in trees is the parent of a node. In the graph above, node 2 is the parent of node 5.
3.2. Representing a Graph Structure
There are several ways to represent graph data structures. Of the multitude of ways to represent a graph, which one we choose depends on how we’ll process and manage the graph data.
One representation centers on the nodes. We create a graph as a set of nodes. In each node, we put information about how we connect that node to the other nodes.
For example, we can outline the structure of node 1 in our earlier tree:
- Node Information
- Node Color: 1
- Edge Information
- Edge 0: Directed to 2
- Edge 1: Directed to 3
- Edge 2: Directed to 4
Another representation focuses on the edges. We see the graph as a collection of edges. Each edge has the color and type of the edge and the two nodes that it connects. The node information in the edge could be a pointer or an index showing where the node information is.
For example, we can outline the structure of our edge between node 1 and node two in our tree:
- Edge Information
- Edge Type: directed
- Edge Color: none
- Node Information
- Pointer/index to node 1 info
- Pointer/index to node 2 info
Another representation of the graph data structure is an adjacency matrix. If our graph has n nodes, its adjacency matrix will have n rows and columns.
For example, let’s take a look at a graph and its adjacency matrix:
Each row represents one of our nodes, and so does each column. If there’s a value in one of the cells, say in row 1 and column 4, then there’s an edge between node 1 and node 4. We can use different numbers in the cells to symbolize that edge’s weight.
With this setup, we can represent three types of relationships. Note that we’ll use A[i,j] to represent a cell in the matrix at row i and column j.
First, *if a specific cell is 0, say at A[i,j], then there’s no edge between nodes i and j*. The value of zero in A[1,3] means there’s no edge between node 1 and node 3.
Second, *if a specific cell is non-zero at A[i,j], then there’s an edge between nodes i and* j. We draw this as a directed edge. Since there’s a value of 5 in A[3,2], that means there’s a directed edge between node 3 and node 2.
Third, if both cells at A[i,j] and A[j,i] are non-zero, then there’s an edge between nodes i and j as before, but this time we’ll use an undirected edge. Since there’s a value of 1 in A[1,4] and in A[4,1], that means there’s an undirected edge between nodes 1 and 4.
Remembering that an edge’s property is its color, note that we simplify a bit and say that an edge with a weight of 1 has no color. So, the edge from node 3 to node 2 has a color of 5, but all the rest of our edges have no color.
3.3. Uses of Trees and Graphs
An important concept of the semantic web is an ontology, or vocabulary of linked data. In this type of example, the colored nodes of the graph are entities of information and the colored connections indicate their relationship. The resource description framework (RDF) formalizes how these relationships are represented.
For example, we can represent the RDF relationships between people as a graph:
In artificial intelligence search algorithms, a search is made through possible solutions to a problem.
Our current search position is the current “state” of the problem, which we can represent as a node.
How we traverse from this state to another state we can represent by a directed edge. The color of the edge is how we modified the current state to get to the next state.
Below, we can see an example from game theory showing the moves in a tic-tac-toe game.
We’re searching for the best set of moves to win the game. Each state is how the X’s and O’s are placed. We see that there are two winning paths for X:
We see, in this example, the importance of graphs in artificial intelligence. In fact, the basis of what can be said of the first programming language of artificial intelligence, namely LISP, is the graph. Though strictly speaking the fundamental structure in LISP is the linked list – we’ll see later that trees and graphs can be made with linked lists.
Graph trees can a useful representation of the parsing and manipulation of algebraic expressions. This idea stems from lambda calculus and the foundations of functional programming.
In the example below, we can see how we can represent a simple algebraic function as a graph:
In chemistry, particularly organic chemistry, a molecule can be represented as a graph. The colored nodes are the atoms with the atomic properties and the edges are the bonds between the atoms. In addition, reactions can be viewed as graphical transformations. Furthermore, whole systems of chemical reactions can be viewed as a chemical reaction network.
We can visualize a sucrose molecule as a graph in three dimensions:
4. Linked Lists
Another set of common useful data structures in programming are linked lists.
The basic element in a linked list is an element with two boxes, one with data and one with a pointer to the next element:
We have basically three different possibilities of connecting these elements:
- Only data: The data box has the appropriate data and the next box is null (points to nothing). This represents the “last” element.
- Data in a list: The data box has information and the next box to another element.
- Pointers: Both the data and the next boxes point to other elements. This represents branching.
4.1. Linked List Representation of Arrays and Trees
This data structure is very flexible.
For example, we can represent a consecutive set of numbers as a linked list:
The elements 0 through 4 each point to their next element and the last element, 5, points to nothing (has null as the pointer).
Or, if the data box is itself a pointer, any graph can be represented. The data box would be the node information and the edges are the next pointers to the other elements:
As mentioned before, the linked list is the fundamental structure of the first language for artificial intelligence, LISP, developed as early as 1960. It’s a remarkable achievement considering the predominant programming languages were numerical non-recursive languages like FORTRAN.
4.2. Uses of Linked Lists
The choice of which data structures to use can depend on different factors.
We can see that linked lists and arrays both can represent an ordered list. But what happens when the prevalent operation is to insert an element in this list? The computational complexity of this operation, meaning how long it’ll take, is quite different.
We’ll illustrate this point with a simple example of inserting a number into an ordered list using vector and linked list representations.
To insert a number into an array, we first have to allocate and make a copy of the list to add an extra field.
The complexity of this operation depends on the length of the list. As the list grows, so does the time to perform this task. In complexity notation, this is O(n).
Let’s imagine the process of inserting an extra number in an array of 12 numbers:
However, for linked lists, the basic operation is just to move one pointer to the new element. We set the new element’s pointer to the rest of the list.
This operation does not depend on the length of the list. No matter how big the list is, the complexity is the same. In complexity notation, this is O(1).
Now, let’s imagine the process of inserting an extra element in an array represented as a linked list:
Thus using a linked list could represent significant computational savings.
5. Stacks and Queues
Stacks and queues represent another useful set of data structures that has to do with how we put – or push – objects into a list and in what order we remove – or pop – objects from the set.
5.1. Stacks
A stack is just as it sounds, for example, a stack of books. Every time we push an object onto the stack, it goes on top.
As with a stack of books, when we pop an object off the stack it comes from the top. Another way to say this is “Last In and First Out” LIFO
Let’s see how we can push five objects onto the stack and that the “last” object we put in is the “first” object that we pop off:
5.2. Queues
A queue is also just as it sounds, for example standing in line.
Every time we enqueue an object, it’s added to the back of the queue. When we dequeue an object, we get it from the front. Again, another way to say this is “First In and First Out” (FIFO).
When we want to add a ball into a pipeline, we enqueue it. We see that the “first” ball we dequeue is the first ball we enqueued at the beginning of the process:
With some extra work, we can imagine how we might allow certain balls to skip part of the line in a priority queue.
5.3. Uses of Stacks and Queues
We can see that stacks and queues have similar functions. They determine, for example, the order in which a set of objects are to be processed.
A common use of a stack is the “undo” function. When we take an action, that action goes to the top of the stack. And when we want to “undo” this action, it comes from the top of the same stack:
Another common use of stack is in recursion. When we make a recursive call, we store the state of the current context at the top of the stack. When we return from a recursive call, we retrieve the top context from the stack and restore it. For example, we show in this figure how the recursive program to calculate factorial stores the previous call environments for each recursive call. We see that the result is returned and set into the calling statement:
We can use queues whenever we want to be fair about distributing a task. For example, when we want to send a printout to the printer. Or, when we want to send a computing job to a central processor in the cloud. When we have a job to do, we add it to the back of the queue. When we’re ready to do a job, we pick the next one up from the front.
6. Summary
In this article, we presented several common data structures that are useful in programming.
We represented some data structures, such as the graph, in several ways. And, we gave examples illustrating the inter-relationship between the structures.
We also saw that the choice of data structure, for example, an array or linked list, can affect the complexity of the process.