1. Overview
Sets are a fundamental data structure in programming, offering a collection of unique elements without any predefined order. They’re incredibly versatilsorte, allowing us to check for membership, eliminate duplicates, and make comparisons. While Bash doesn’t offer a built-in set-like structure, there are some equivalents we can use.
In this article, we’ll explore some effective workarounds for set data structure in Bash. Initially, we’ll learn about associative arrays (Bash 4.0 and above). After that, we’ll discuss tools like sort and uniq.
Finally, we’ll experiment with custom functions to tackle set-related tasks in our Bash scripts.
2. Sets in Bash
Unlike some programming languages, Bash doesn’t have a built-in set data structure. While this keeps Bash lightweight and fast, it presents a challenge for those seeking to manage collections of unique elements in their scripts.
2.1. Why There Are No Native Sets
Bash’s core design philosophy revolves around simplicity and efficiency. In essence, it forgoes some complex data structures, including sets, in favor of a more streamlined approach. This lean design makes Bash ideal for automating tasks as well as interacting with the operating system quickly.
2.2. Use Cases of Sets
While Bash offers arrays, they don’t perfectly replicate the functionality of sets. Sets provide us some important features such as:
- membership testing – checking to see if an element exists within a collection.
- deduplication – eliminating duplicate entries from a data set.
- set operations – intersection, difference, union, and so on.
These would be useful in scripts that deal with capturing unique inputs, or performing data verification tasks.
3. Associative Arrays
Associative arrays on Bash 4.0 or later, are effective for emulating set functionality. Unlike traditional Bash arrays, which store elements in a sequential order, associative arrays function more like dictionaries or hash tables.
They store data in key-value pairs, where the keys serve as the unique elements of the set. The values associated with the keys can be ignored; they’re just placeholders.
This structure allows for swift retrieval and manipulation of elements based on their unique keys, mirroring the core concept of a set where membership is determined solely by the element itself.
3.1. Declaring a Set
To create an associative array, we use the declare -A syntax followed by the chosen array name:
$ declare -A unique_elements
Now unique_elements acts as a set, ready to hold a collection of unique elements.
3.2. Adding Elements
Though the associative array maps a key to a value, we can use it as a set by mapping our set’s entries to any value, for example true. We can only have one value associated with a specific key. So, attempting to add a duplicate key will overwrite the existing value, effectively ensuring a collection of distinct elements.
Let’s add some elements to the set:
$ unique_elements["apple"]=true
Let’s add some more elements in a similar fashion:
$ unique_elements["banana"]=true
$ unique_elements["orange"]=true
$ unique_elements["apple"]=false
Adding another element with the same key apple, overwrites the previous data stored for that key as we’ll see when we print out the set.
3.3. Printing the Set
To print the set, we use the exclamation mark ! within curly braces before the array name:
$ echo "${!unique_elements[@]}"
orange apple banana
We should note, that in Bash we normally print collections using the @ symbol within square brackets [] after the array name:
$ echo "${unique_elements[@]}"
true false true
As we can see, though, this approach iterates through the array and prints only the value associated with each key, which are of no interest when using the associative array as a set.
3.4. Membership Testing
To determine if an element (key) exists within the set, we can use the same key within square brackets to check for its presence:
~$ [[ ${unique_elements["apple"]} ]] && echo "apple is in the set" || echo "apple is not in the set"
apple is in the set
This conditional expression checks if the key apple exists in the associative array and evaluates to true if the key exists or false otherwise.
4. Using External Tools
While associative arrays offer a convenient way to manage sets in Bash, sometimes we might encounter situations where using external tools can be beneficial.
4.1. Deduplication With awk
awk is a powerful text processing tool that can be used for various tasks, including set-like operations.
Let’s suppose we already have a file with some data:
$ cat groceries.txt
apple
banana
carrot
banana
grape
apple
We can filter through and remove duplicates with awk:
$ awk '!seen[$0]++' groceries.txt
apple
banana
carrot
grape
This command processes the file line by line and keeps track of seen lines within that single execution.
4.2. Deduplication With sort and uniq Commands
Similarly, we can use sort and then feed the output into uniq to filter through:
$ sort groceries.txt | uniq
apple
banana
carrot
grape
uniq is designed to identify and remove adjacent duplicate lines. This means it only compares a line with the one directly before it. If the lines aren’t sorted, uniq might miss duplicates that appear non-consecutively:
$ uniq groceries.txt
apple
banana
carrot
banana
grape
apple
We don’t want this, hence the use of sort to ensure all duplicate occurrences are next to each other.
4.3. Membership Testing With grep
If we’re working with a text file as opposed to associative arrays, we can perform membership testing with grep:
$ grep -q "apple" "groceries.txt" && echo True || echo False
True
grep operates on lines of text, making it a good option in this case. We should note, however, that it has to scan the whole file to check it, so this is not as efficient as using a set-like structure in memory.
5. Set Operations
Beyond membership testing and deduplication, we can perform some operations on two or more sets including:
- union – a combination of all elements of the sets
- intersection – only the elements common to the sets
- difference – all the elements present in either set but not the other one
- subset – determining if all members of one set are also members of another set
5.1. Union of Two Sets
Let’s suppose we have a second group of elements:
$ cat new_groceries.txt
apple
grapefruit
kiwi
orange
pear
We can find the union of both sets:
$ sort groceries.txt new_groceries.txt | uniq
apple
banana
carrot
grape
grapefruit
kiwi
orange
pear
First, we sort the two files together, then we remove any duplicate elements.
5.2. Intersection of Two Sets
The comm command excels at comparing two sorted files line by line. It categorizes lines into three groups:
- common lines present in both files
- unique to the first file, not found in the second file
- unique to the second file, not found in the first file
This functionality enables us to mimic set behavior for tasks like finding the intersection or difference between two sets of data stored in separate files.
Let’s run the comm command to find the intersection:
$ comm -12 groceries.txt new_groceries.txt
apple
Option -1 suppresses the column showing lines unique to the first file, while option -2 suppresses the column showing lines unique to the second file. Consequently, the output is the common lines (intersection) between the two files.
5.3. Difference of Two Sets
Alternatively, we can find the difference:
$ comm -3 groceries.txt new_groceries.txt
banana
carrot
grape
grapefruit
kiwi
orange
pear
The -3 option specifically displays lines that are unique to both files.
5.4. Custom Functions
While the tools we’ve just discussed provide a definitive approach for mimicking set behavior, we can perform even more complex operations with custom functions.
For example, we can check if a file is a subset of another file. To do this, we need a function that accepts two arguments, the file we’re checking and the reference file:
$ cat is_subset.sh
#!/bin/bash
is_subset() {
local elements_file="$1"
local reference_file="$2"
# implementation
}
is_subset "$1" "$2"
Next, we’ll create an associative array inside the is_subset() function with the content of the reference file:
declare -A ref_set
# Read reference set and populate the array
while read -r line; do
ref_set["$line"]=1
done < "$reference_file"
Finally, we’ll pass elements_file into a loop and check for any line that isn’t in the reference_file. If there is, we immediately know that the elements_file is not a subset, and we exit the loop:
while read -r line; do
if [[ ! "${ref_set[$line]}" ]]; then
echo "Is not a subset"
return 1
fi
done < "$elements_file"
However, if the loop finishes without exiting, then elements_file is a subset:
echo "Is a subset"
return 0
Let’s give it a try:
$ bash is_subset.sh new_groceries.txt groceries.txt
Is not a subset
As expected, this is not a subset. Let’s do it again with a new script to get a positive result:
$ cat apple.txt
apple
$ bash is_subset.sh apple.txt groceries.txt
Is a subset
We can perform multiple other set operations by applying logic within custom functions.
6. Conclusion
In this article, we discussed various equivalents for set data structure in Bash. We learned that sets, collections of unique elements, can be implemented using associative arrays. We also tried out tools like awk, sort, uniq, and comm which are more suitable for advanced tasks or handling large datasets.
Finally, for a more complex set manipulation, we created a custom function.
The best approach for working with sets in Bash depends on the specific task at hand. We should consider the size and complexity of the data, and the availability of Bash functionality and its surrounding tools.