1. Overview
Extracting the base filename from a file path or a URL is a common task in Linux shell programs.
In this quick tutorial, we’ll explore two methods to find the base filename of a URL in Linux.
2. Parameter Expansion
In Linux shell programming, a parameter is an entity that stores values. It may be referenced by name, number, or one of several special characters.
Meanwhile, a variable is a value that’s referenced by name. For example, we can use a variable to store the name of a file:
$ FILENAME="filename.txt"
Parameter expansion is the substitution of a reference with its value. To expand a variable, for instance, we use the $ prefix:
$ echo ${FILENAME}
filename.txt
The braces around the variable name are optional in this simple case. However, they allow us to take advantage of other operators to solve more complex problems.
For instance, the # operator removes a substring matching a given pattern from the front of a variable value:
$ VAR1="apple pie"
$ echo ${VAR1#*p}
ple pie
A single # operator removes the shortest prefix matching the given pattern, which is *p in our case. Here, the asterisk is a wildcard to indicate zero or more characters before the p. In this case, the # operator removes the first prefix substring ending in the letter p, which is ap.
By contrast, a double ## operator removes the longest prefix matching the given pattern, or the longest substring ending in the letter p in this case:
$ VAR1="apple pie"
$ echo ${VAR1##*p}
ie
Likewise, the % and %% operators remove suffixes matching a given pattern. For instance, we can use the %% operator to strip the largest suffix that begins with a p from our example:
$ VAR1="apple pie"
$ echo ${VAR1%%p*}
a
These operators can be useful when it comes to URL data extraction.
3. Parameter Expansion With URLs
We can now use variable expansion operators to find the base filename from a given URL.
First, it’s helpful to understand that a URL is a type of URI. URLs are composed of several parts:
scheme:[//authority][/path][?query][#fragment]
For URLs, scheme is the name of the access protocol. Examples include http or https (and there are many others).
The authority element often consists of a hostname or IP address (and optional port). The path specifies a resource in the scope of its scheme and authority.
The query and fragment suffixes are optional. If they are present, though, they must be ordered as we see above for a URL to be well-formed.
Now, we can use the ## operator with a forward slash (/) pattern to find the base filename from a URL:
$ URL="http://example.com/dir/file.html"
$ echo ${URL##*/}
file.html
On the other hand, we can use the %% operator to remove the suffix from a more complex URL that contains a query, a fragment, or both:
$ URL="http://example.com/dir/file.html?par1=value#frag"
$ echo ${URL%%[?#]*}
http://example.com/dir/file.html
In this example, the %% operator searches for the ? or # characters in the globbing pattern [?#]. It then removes the largest matching substring.
Now, we can use ## and %% to construct a solution that finds the base filename in any well-formed URL:
$ URL="http://example.com/dir/file.html?par1=value#frag"
$ fileAndSuffix=${URL##*/}
$ echo ${fileAndSuffix%%[?#]*}
file.html
The fileAndSuffix variable holds the original URL, but with the prefix removed. The parameter expansion in the echo command then removes the query and fragment suffixes.
Parameter expansion is supported by all common shells. GNU.org maintains a complete list of parameter expansion modifiers.
4. The basename Command
Another option for finding the base filename in a simple URL is the basename command, which is part of the GNU Coreutils library.
$ URL="http://example.com/dir/file.html"
$ basename $URL
file.html
While basename strips the prefix from a URL, it doesn’t remove suffixes.
In other words, it won’t work for our more complex URL:
$ URL="http://example.com/path/to/page.html?par1=value&par2=value#frag1"
$ basename $URL
page.html?par1=value&par2=value#frag1
Most (but not all) Linux distributions use Coreutils and therefore offer basename as an available command
5. Conclusion
In this article, we explored two methods to extract the base filename from a URL in a Linux shell.
First, we learned ways to use parameter expansion to trim prefixes and suffixes from a URL. Then, we saw how the basename command can achieve the same goal for simple cases. Both are common Linux tools.
While basename works well for simple URLs, it fails for more complex cases. Parameter expansion, on the other hand, is a powerful tool for solving problems beyond just filename extraction.