1. Overview

While performing text manipulation, especially with big datasets, removing the last instance of a pattern can be challenging since common tools might not easily handle this task.

In this tutorial, we’ll learn how to remove the last occurrence of a pattern in a file using command-line utilities such as sed, awk, and tac.

2. Scenario Setup

Let’s take a look at the items.txt sample text file:

$ cat items.txt
item1,item2,item3,
item4,item5,item6,
item7,item8,item9,

The file contains comma-delimited values.

Unfortunately, the last line has an extra occurrence of a comma (,) after the item9 value. So, we aim to remove the last comma from the items.txt file.

3. Using sed

Let’s explore how to solve this use case via the sed command-line utility.

3.1. Using Greedy Approach

We can start with one of the common sed idioms to read the entire file into the pattern space:

$ sed -E ':a;N;$!ba' items.txt
item1,item2,item3,
item4,item5,item6,
item7,item8,item9,

Now, let’s break this down to understand the nitty-gritty of the logic. To begin with, we add the a label to facilitate iteration. Then, we continue by appending the next line using the N and b functions until we reach the last line ($). Finally, we can see that the entire file is displayed because of the default behavior of sed to print the pattern space.

If we look closely, there are just two commands other than the label definition:

:a
N
$!ba

Finally, we can use a greedy match approach with the (.*),(.*) group in the substitution (s) command. This group-based substitution splits the entire pattern space into two groups, namely, \1 and \2, separated by a comma:

$ sed -E ':a;N;$!ba; s/(.*),(.*)/\1\2/' items.txt
item1,item2,item3,
item4,item5,item6,
item7,item8,item9

Thus, we get the correct results.

3.2. Using tac

The greedy approach to read the entire file into the pattern space works fine for smaller datasets. However, we may notice performance issues with large datasets because of extensive memory utilization.

To optimize memory usage, we can employ the tac command to reverse the order of lines, remove the pattern, and then reverse the order of lines back again:

$ tac <file> | <sed script to remove> | tac

Since tac shows the file’s contents with the last line first and the first line last, we use sed to remove the last occurrence of the pattern in the first line containing it.

Let’s see the entire series of commands in action:

$ tac items.txt \
| sed -n -E ':remove_and_print;s/(.*),(.*)/\1\2/;t print_only; p; n; b remove_and_print :print_only; p; n; b print_only;' \
| tac
item1,item2,item3,
item4,item5,item6,
item7,item8,item9

Again, the approach works as expected. Let’s break down the regular expression (-E) used in the sed command.

We define two labels for flow control, namely, remove_and_print and print_only. Then, within the remove_and_print, we try to substitute the last occurrence of the pattern on that specific line. After a successful substitution, the flow is transferred to print_only:

:remove_and_print
s/(.*),(.*)/\1\2/;
t print_only;
p;
n;
b remove_and_print;

Moreover, within the print_only block, we continue to take the next (n) line and print (p) it:

:print_only;
p;
n;
b print_only;

Notably, the advantage of this approach is that we’re keeping a single line in the pattern space, so it doesn’t use much memory.

4. Using awk

Let’s learn how we can use the awk utility to remove the last occurrence of a comma in the items.txt file.

4.1. Using Buffer Array

Let’s start by looking at the remove_comma.awk script in its entirety:

$ cat remove_comma.awk
function sub_at_position(line, position) {
    len = length(line);
    pre = substr(line, 1, position-1);
    post = substr(line, position+1, len-position);
    return pre post;
}
{
    buffer[NR] = $0;
    n = split($0, a, ",");
    if (n > 1) {
        last_occurrence = NR;
        position_last_comma = length($0) - length(a[n]);
    }
}
END {
    for (i = 1; i <= NR; i++) {
        if (i == last_occurrence) {
            buffer[i]=sub_at_position(buffer[i], position_last_comma);
        }
        print buffer[i];
    }
}

Now, we can go step by step to understand the code flow within the script:

  1. define the helper sub_at_position() function that accepts two positional parameters, line and position, and splits the line into pre and post as text falling before and after the position
  2. store each line in the buffer array
  3. keep track of the last line number containing a comma with the last_occurrence variable
  4. define the position_last_comma variable to store the last comma position for this line
  5. print each line from the buffer array in the END block

Only for the last_occurrence line, we use the sub_at_position() function to remove the comma marked by the position_last_comma index.

Finally, let’s execute the remove_comma.awk script to remove the last occurrence of comma (,) in the items.txt file:

$ awk -f remove_comma.awk items.txt
item1,item2,item3,
item4,item5,item6,
item7,item8,item9

It looks like we nailed this one as well.

4.2. Using tac

Like the greedy approach with the sed utility, the buffer-based approach with awk utilizes a lot of memory. So, it’s not preferred for large datasets. However, we can again optimize the approach by using tac.

In this case, tac reverses the items.txt file, removing the comma from the first matching line, and reversing it back:

$ tac items.txt | awk -f remove_comma_optimized.awk | tac

Now, let’s take a look at the remove_comma_optimized.awk script in its entirety:

$ cat remove_comma_optimized.awk
function sub_at_position(line, position) {
    len = length(line);
    pre = substr(line, 1, position-1);
    post = substr(line, position+1, len-position-1);
    return pre post
}
BEGIN {
    is_done=0;
}
{
    if (!is_done) {
        n = split($0, a, ",");
        if (n > 1) {
            last_occurrence = NR;
            position_last_comma = length($0) - length(a[n]);
            $0=sub_at_position($0, position_last_commma);
        is_done=1;
        }
    }
    print $0
}

Next, we can understand the optimizations done in remove_comma_optimized.awk script over the remove_comma.awk script:

  • we reused the sub_at_position() function from the remove_comma.awk script
  • we no longer use the buffer array, so we’ve removed the END block
  • we defined the is_done variable in the BEGIN block to track the remove operation, so we can use this for performing a one-time removal operation

Lastly, let’s execute the remove_comma_optimized.awk script in combination with tac:

$ tac items.txt | awk -f remove_comma_optimized.awk | tac
item1,item2,item3,
item4,item5,item6,
item7,item8,item9

It works as expected and removes the last occurrence of a comma from the input file.

5. Using the Vim Editor

Vim is a versatile text editor that can be used for effective text manipulation. We can write a vim script to solve the use case of removing the last occurrence of a pattern in a file.

5.1. Vim Script

We can automate text editing operations using a vim script and run them repeatedly. So, let’s write a basic function in the remove_last_pattern.vim Vim script file:

$ cat remove_last_pattern.vim
function! RemoveLastPattern(pattern)
    " Get the total number of lines in the file
    let l:last_line_num = line('$')

    " Move cursor to the end of the last line
    normal! $

    " Initialize a flag to track if pattern is found
    let l:pattern_found = 0

    " Get the current line where the cursor is positioned
    let l:line = getline('.')

    " Find the position of the last occurrence of the pattern in the current line
    let l:pos = strridx(l:line, a:pattern)

    " Search for the last occurrence in the entire file
    for l:lnum in reverse(range(1, l:last_line_num))
        let l:line = getline(l:lnum)
        let l:pos = strridx(l:line, a:pattern)
        if l:pos != -1
            let l:line = l:line[:l:pos - 1] . l:line[l:pos + len(a:pattern):]
            call setline(l:lnum, l:line)
            let l:pattern_found = 1
            break
        endif
    endfor
endfunction

" Map the function to a command for ease of use
command! -nargs=1 RemoveLast :call RemoveLastPattern()

Initially, the script can look overwhelming. However, it’s just a series of vim commands. Let’s look closer to understand the complete logic and each action within the RemoveLastPattern() function:

  1. get the total number of lines available in the file
  2. move the cursor to the last line using the $ command in normal mode
  3. initialize a few variables to track the cursor position and the position of the last occurrence of a pattern in the current line
  4. initiate a loop to search for a pattern in reverse order

A string function, strridx(), finds the index of the last occurrence of a pattern in the current line The strridx() function returns -1 if a match isn’t found. If a match is found, we remove the pattern from the current line and use the break command to end the loop iterations.

Lastly, we create a custom command mapping RemoveLast that calls the RemoveLastPattern with exactly one argument (-nargs=1). Notably, gets replaced by the argument passed to the RemoveLastPattern() function.

5.2. Vim Script in Action

Let’s open the items.txt file using the vim command:

$ vim items.txt

Now, we source the remove_last_pattern.vim script so that we get access to the RemoveLast custom command:

:source remove_last_pattern.vim

Next, we can call the RemoveLast command with a comma (,) as the first argument:

:RemoveLast ,

At this point, we successfully removed the last occurrence of a comma (,) in the items.txt file.

Finally, after verifying the changes, we can choose to save the file:

:wq

Thus, we have a convenient way to implement and apply the use case in Vim.

6. Conclusion

In this article, we learned how to remove the last occurrence of a pattern in a file.

In particular, we explored command-line utilities, such as sed, awk, tac, and vim. The choice between these options depends on the context and preference of the user.