1. Overview
In computing, many string types aren’t fit for different applications due to harder storage and interpretation. Due to this, there are many ways to encode a string with a specific set of characters.
In this tutorial, we’ll learn what URL encoding is. Then, we’ll go through a few different methods for decoding encoded URLs in Linux.
2. Understanding URL Encoding
URL encoding and decoding are standardized through RFC 3986. In short, some characters in a URL are represented using percent-encoding – a percent sign followed by two hexadecimal digits.
Percent encoding is used to encode reserved characters:
Type
Key
Reserved Characters
! * ‘ ( ) [ ] ; : @ & = + $ , / ? % # { } < >
Unreserved Characters
a-z A-Z 0-9 – _ . ~
Reserved characters have a special purpose and should be encoded when in a URL. All other characters are unreserved characters but can be represented using percent-encoding as well.
Lastly, spaces are represented as plus signs:
Encoded: example.org/5+Percent+Codes+%21%2A%27%2B
Decoded: example.org/5 Percent Codes !*'(+
As we can see in this example, we have an encoded string containing five percent codes and three pluses. The pluses get turned into spaces and the percent codes get decoded to ASCII.
3. Encoding URLs
To encode a URL, we follow two steps:
- replace any spaces with a plus sign or %20
- replace all reserved and special characters with their percent-encoded equivalent
We can perform these actions in many ways.
3.1. Using the Shell
Let’s create a Bash script that takes a parameter and encodes it:
$ cat percentencode.sh
#!/bin/bash
len="${#1}"
for ((n = 0; n < len; n++)); do
c="${1:n:1}"
case $c in
[a-zA-Z0-9.~_-]) printf "$c" ;;
*) printf '%%%02X' "'$c"
esac
done
In particular, we first get the string [len]gth of the first (1) command-line argument and use that to loop over each character within it. For each, we check whether it’s a regular printable ASCII character via the [a-zA-Z0-9.~_-] regex. If so, we print it as is. If not, we use the printf command to print a percent-encoded version of the character.
Let’s test the result:
$ chmod +x percentencode.sh
$ percentencode.sh 'example.org/E = mc^2'
example.org%2FE%20%3D%20mc%5E2
Although the %20 replacements can be plus signs, this implementation uses the encoded version.
3.2. Using perl and python
Let’s see a perl solution for percent-encoding:
$ perl -MURI::Escape -e 'print uri_escape($ARGV[0])' 'http://example.com/E = mc^2'
http%3A%2F%2Fexample.com%2FE%20%3D%20mc%5E2
Here, we use the URI::Escape module and its uri_escape() function to convert the string. Notably, even / forward slashes are encoded.
When using python, we can employ the urllib module to both encode and decode. Let’s do the former:
$ python -c 'import urllib, sys; print urllib.quote(sys.argv[1])' 'http://example.com/E = mc^2'
http%3A//example.com/E%20%3D%20mc%5E2
In this case, we can also use urllib.quote_plus() to have spaces converted to a + plus sign instead of %20.
If we require Python 3, we can use a more modern version of the same:
$ python3 -c 'import urllib.parse, sys; print(urllib.parse.quote(sys.argv[1]))' 'http://example.com/E = mc^2'
http%3A//example.com/E%20%3D%20mc%5E2
Again, quote_plus() can augment the output.
3.3. Using jq
Another workable, albeit non-typical, solution comes with the jq tool:
$ jq --slurp --raw-input --raw-output @uri <(printf 'http://example.com/E = mc^2')
http%3A%2F%2Fexample.com%2FE%20%3D%20mc%5E2
Here, we use the –slurp (-s) flag to get the whole input as a —raw-input (-R) array and output it as –raw-output (-r) without further conversion. The filter is @uri for encoding.
Notably, we leverage process substitution since jq expects a file. Further, printf avoids the trailing newline from tools like echo.
3.4. Defining Aliases
As usual, we can define an alias with an encoding command:
$ alias pyncode='python -c "import urllib, sys; print urllib.quote(sys.argv[1])"'
Once defined, we can use the alias for easier encoding:
$ pyncode 'http://example.com/E = mc^2'
http%3A//example.com/E%20%3D%20mc%5E2
This way, we don’t have to search for or retype long commands.
4. Decoding URLs
To decode a URL, we generally follow two steps:
- replace any plus signs with spaces
- remove the percent signs and convert the following two hexadecimal digits to ASCII
This way, we convert each encoded combination into a character.
4.1. Using the Shell
Let’s begin with a simple bash solution without the use of external programs:
$ (IFS="+"; read _z; echo -e ${_z//%/\\x}"") <<< 'example.org/end+sentence+.%3F%21'
example.org/end sentence .?!
The first part of this one-liner works through word splitting at the plus sign via IFS. To ensure expansion occurs, we don’t quote the variable. However, we put empty quotes after the variable so that a plus sign at the end of a URL doesn’t get cut off. Further, we use a subshell so that IFS isn’t changed globally.
To take input as a variable, we can use read. This enables us to use parameter expansion on this variable to replace all occurrences of percent signs with \x. Then, we use echo -e to interpret these escapes.
It’s a bit more difficult and less efficient, but we can make a portable, POSIX-compliant shell script to accomplish this as well:
#!/bin/sh
posix_compliant() {
strg="${*}"
printf '%s' "${strg%%[%+]*}"
j="${strg#"${strg%%[%+]*}"}"
strg="${j#?}"
case "${j}" in "%"* )
printf '%b' "\\0$(printf '%o' "0x${strg%"${strg#??}"}")"
strg="${strg#??}"
;; "+"* ) printf ' '
;; * ) return
esac
if [ -n "${strg}" ] ; then posix_compliant "${strg}"; fi
}
posix_compliant "${*}"
Here, we use recursion along with POSIX-supported parameter expansion to decode the same string. First, we convert the hexadecimal characters to octal to avoid hex conversion which is unsupported by POSIX printf.
After creating this shell script, we make it executable using chmod and then execute it by using its full path:
$ chmod +x decode.sh
$ /path/of/script/decode.sh 'example.com/a%26b%40c'
example.org/a&b@c
In this more complex variant, the script should work in almost any shell without an outside program.
4.2. Using perl and python
Depending on the environment and requirements, we might prefer to use a scripting language interpreter for the same task.
Creating a solution using perl is as simple as with a shell:
$ perl -pe 's/\+/\ /g;' -e 's/%(..)/chr(hex($1))/eg;' <<< 'example.org/%3C%2Fend%3E'
example.org/</end>
Here, we use the perl substitution operator to replace the plus signs in the string with spaces. Afterward, we substitute percent signs and the following two-digit hexadecimal with the ASCII equivalent. Specifically, the e modifier evaluates the expression chr(hex($1)) with hex converting to decimal and then chr converting to ASCII.
Finally, let’s create a solution using python:
$ python -c 'print(input().replace("+", " ").replace("%", "\\x").encode().decode("unicode_escape"))' <<< 'example.org/%7B1%2C2%7D'
example.org/{1,2}
This works the same way as the previous example except we wait to convert to ASCII until the very end. We replace the percent signs with *\*\x and then convert the string to bytes using the str.encode() method so that we can use bytes.decode(), unescaping \\x into the \x operator.
4.3. Defining Aliases
To enable easier URL conversion on the command line, we can define an alias in a custom ~/bashrc:
alias decode_url='perl -pe '\''s/\+/ /g;'\'' -e '\''s/%(..)/chr(hex($1))/eg;'\'' <<< '
In this case, we convert the perl code to an alias by surrounding it with single quotes. To preserve the single quotes within the perl script, we insert \’ in their place.
Now let’s test out the alias:
$ decode_url 'example.org/E+%3D+mc%5E2'
example.org/E = mc^2
Thus, we call the new alias from the terminal and it decodes the URL we pass to it.
5. Conclusion
In this article, we learned what URL encoding is and what purpose it serves. Then, we discussed a few ways to encode a regular URL and decode an encoded URL. Further, we defined aliases for each action.