Bash中"Set"数据结构的等效实现

1. Overview

Sets are a fundamental data structure in programming, offering a collection of unique elements without any predefined order. They’re incredibly versatilsorte, allowing us to check for membership, eliminate duplicates, and make comparisons. While Bash doesn’t offer a built-in set-like structure, there are some equivalents we can use.

In this article, we’ll explore some effective workarounds for set data structure in Bash. Initially, we’ll learn about associative arrays (Bash 4.0 and above). After that, we’ll discuss tools like sort and uniq.

Finally, we’ll experiment with custom functions to tackle set-related tasks in our Bash scripts.

2. Sets in Bash

Unlike some programming languages, Bash doesn’t have a built-in set data structure. While this keeps Bash lightweight and fast, it presents a challenge for those seeking to manage collections of unique elements in their scripts.

2.1. Why There Are No Native Sets

Bash’s core design philosophy revolves around simplicity and efficiency. In essence, it forgoes some complex data structures, including sets, in favor of a more streamlined approach. This lean design makes Bash ideal for automating tasks as well as interacting with the operating system quickly.

2.2. Use Cases of Sets

While Bash offers arrays, they don’t perfectly replicate the functionality of sets. Sets provide us some important features such as:

membership testing – checking to see if an element exists within a collection.
deduplication – eliminating duplicate entries from a data set.
set operations – intersection, difference, union, and so on.

These would be useful in scripts that deal with capturing unique inputs, or performing data verification tasks.

3. Associative Arrays

Associative arrays on Bash 4.0 or later, are effective for emulating set functionality. Unlike traditional Bash arrays, which store elements in a sequential order, associative arrays function more like dictionaries or hash tables.

They store data in key-value pairs, where the keys serve as the unique elements of the set. The values associated with the keys can be ignored; they’re just placeholders.

This structure allows for swift retrieval and manipulation of elements based on their unique keys, mirroring the core concept of a set where membership is determined solely by the element itself.

3.1. Declaring a Set

To create an associative array, we use the declare -A syntax followed by the chosen array name:

$ declare -A unique_elements

Now unique_elements acts as a set, ready to hold a collection of unique elements.

3.2. Adding Elements

Though the associative array maps a key to a value, we can use it as a set by mapping our set’s entries to any value, for example true. We can only have one value associated with a specific key. So, attempting to add a duplicate key will overwrite the existing value, effectively ensuring a collection of distinct elements.

Let’s add some elements to the set:

$ unique_elements["apple"]=true

Let’s add some more elements in a similar fashion:

$ unique_elements["banana"]=true
$ unique_elements["orange"]=true
$ unique_elements["apple"]=false

Adding another element with the same key apple, overwrites the previous data stored for that key as we’ll see when we print out the set.

3.3. Printing the Set

To print the set, we use the exclamation mark ! within curly braces before the array name:

$ echo "${!unique_elements[@]}"
orange apple banana

We should note, that in Bash we normally print collections using the @ symbol within square brackets [] after the array name:

$ echo "${unique_elements[@]}"
true false true

As we can see, though, this approach iterates through the array and prints only the value associated with each key, which are of no interest when using the associative array as a set.

3.4. Membership Testing

To determine if an element (key) exists within the set, we can use the same key within square brackets to check for its presence:

~$ [[ ${unique_elements["apple"]} ]] && echo "apple is in the set" || echo "apple is not in the set"
apple is in the set

This conditional expression checks if the key apple exists in the associative array and evaluates to true if the key exists or false otherwise.

4. Using External Tools

While associative arrays offer a convenient way to manage sets in Bash, sometimes we might encounter situations where using external tools can be beneficial.

4.1. Deduplication With awk

awk is a powerful text processing tool that can be used for various tasks, including set-like operations.

Let’s suppose we already have a file with some data:

$ cat groceries.txt
apple
banana
carrot
banana
grape
apple

We can filter through and remove duplicates with awk:

$ awk '!seen[$0]++' groceries.txt
apple
banana
carrot
grape

This command processes the file line by line and keeps track of seen lines within that single execution.

4.2. Deduplication With sort and uniq Commands

Similarly, we can use sort and then feed the output into uniq to filter through:

$ sort groceries.txt | uniq
apple
banana
carrot
grape

uniq is designed to identify and remove adjacent duplicate lines. This means it only compares a line with the one directly before it. If the lines aren’t sorted, uniq might miss duplicates that appear non-consecutively:

$ uniq groceries.txt
apple
banana
carrot
banana
grape
apple

We don’t want this, hence the use of sort to ensure all duplicate occurrences are next to each other.

4.3. Membership Testing With grep

If we’re working with a text file as opposed to associative arrays, we can perform membership testing with grep:

$ grep -q "apple" "groceries.txt" && echo True || echo False
True

grep operates on lines of text, making it a good option in this case. We should note, however, that it has to scan the whole file to check it, so this is not as efficient as using a set-like structure in memory.

5. Set Operations

Beyond membership testing and deduplication, we can perform some operations on two or more sets including:

union – a combination of all elements of the sets
intersection – only the elements common to the sets
difference – all the elements present in either set but not the other one
subset – determining if all members of one set are also members of another set

5.1. Union of Two Sets

Let’s suppose we have a second group of elements:

$ cat new_groceries.txt
apple
grapefruit
kiwi
orange
pear

We can find the union of both sets:

$ sort groceries.txt new_groceries.txt | uniq
apple
banana
carrot
grape
grapefruit
kiwi
orange
pear

First, we sort the two files together, then we remove any duplicate elements.

5.2. Intersection of Two Sets

The comm command excels at comparing two sorted files line by line. It categorizes lines into three groups:

common lines present in both files
unique to the first file, not found in the second file
unique to the second file, not found in the first file

This functionality enables us to mimic set behavior for tasks like finding the intersection or difference between two sets of data stored in separate files.

Let’s run the comm command to find the intersection:

$ comm -12 groceries.txt new_groceries.txt
apple

Option -1 suppresses the column showing lines unique to the first file, while option -2 suppresses the column showing lines unique to the second file. Consequently, the output is the common lines (intersection) between the two files.

5.3. Difference of Two Sets

Alternatively, we can find the difference:

$ comm -3 groceries.txt new_groceries.txt
banana
carrot
grape
grapefruit
kiwi
orange
pear

The -3 option specifically displays lines that are unique to both files.

5.4. Custom Functions

While the tools we’ve just discussed provide a definitive approach for mimicking set behavior, we can perform even more complex operations with custom functions.

For example, we can check if a file is a subset of another file. To do this, we need a function that accepts two arguments, the file we’re checking and the reference file:

$ cat is_subset.sh
#!/bin/bash

is_subset() {
  local elements_file="$1"
  local reference_file="$2"

  # implementation
}

is_subset "$1" "$2"

Next, we’ll create an associative array inside the is_subset() function with the content of the reference file:

declare -A ref_set
# Read reference set and populate the array
while read -r line; do
  ref_set["$line"]=1
done < "$reference_file"

Finally, we’ll pass elements_file into a loop and check for any line that isn’t in the reference_file. If there is, we immediately know that the elements_file is not a subset, and we exit the loop:

while read -r line; do
  if [[ ! "${ref_set[$line]}" ]]; then
    echo "Is not a subset"
    return 1
  fi
done < "$elements_file"

However, if the loop finishes without exiting, then elements_file is a subset:

echo "Is a subset"
return 0

Let’s give it a try:

$ bash is_subset.sh new_groceries.txt groceries.txt
Is not a subset

As expected, this is not a subset. Let’s do it again with a new script to get a positive result:

$ cat apple.txt
apple

$ bash is_subset.sh apple.txt groceries.txt
Is a subset

We can perform multiple other set operations by applying logic within custom functions.

6. Conclusion

In this article, we discussed various equivalents for set data structure in Bash. We learned that sets, collections of unique elements, can be implemented using associative arrays. We also tried out tools like awk, sort, uniq, and comm which are more suitable for advanced tasks or handling large datasets.

Finally, for a more complex set manipulation, we created a custom function.

The best approach for working with sets in Bash depends on the specific task at hand. We should consider the size and complexity of the data, and the availability of Bash functionality and its surrounding tools.

Persistence

REST

Security