1. Overview
Since Linux is heavily text-based, text conversion is a fairly common task in the command line.
In this tutorial, we’ll talk about custom text conversion within a command-line environment. Specifically, we’ll convert to the JavaScript Object Notation (JSON) format.
The text source for such conversions can be anything that looks like a list of JSON name-value pairs. For example, from a sensor or from an API that returns the status of a device, we can get a text that looks like a JSON:
[0,0,0]0
prop=1
interval=10
action=0
[1,0,0]1
address=AA:BB:CC:DD:11:22
path=/usr
[2,0,0]1
address=22:11:DD:CC:BB:AA
path=/bin
Consequently, we would like to have this input in a valid JSON format:
{"prop" : "1",
"interval" : "10",
"action" : "0"}
{"address" : "AA:BB:CC:DD:11:22",
"path" : "/usr"}
{"address" : "22:11:DD:CC:BB:AA",
"path" : "/bin"}
Notably, we’ll use a smaller section of this input text and perform the conversion with different methods and tools. The procedures are specific for the input shown, but we can easily change them for other inputs and tweak them accordingly.
2. Using sed
sed is one of the main tools often thought of when dealing with text editing in the command line. That’s because we can perform from very simple to rather complex operations in text files with it. For instance, with a basic match and substitute pattern, we can search and replace strings anywhere in the text.
Two sed flags are especially useful when dealing with complex inputs.:
- -E enables the use of extended regular expressions in addition to basic regular expressions to match patterns
- -z treats the input as lines terminated by a NULL character instead of a newline character
The former provides more flexibility with our search patterns, while the latter enables us to perform operations around the end of a line and the beginning of the next one.
We’ll go step by step, building up our sed command to understand each one of the individual blocks.
2.1. Getting JSON Separators
The first step is to get the curly brackets that JSON uses as separators. We can do that with a simple search and replace pattern:
$ sed 's/^\[.*/}{/g' input.txt
}{
prop=1
interval=10
action=0
}{
address=AA:BB:CC:DD:11:22
path=/usr
We get the whole line (.*) that starts (^) with opening square brackets (\[) and replace it with }{.
However, if we’re familiar with extended regular expressions, we can use them to match more complex patterns. With regex, **we just to escape the curly brackets with *\***:
$ sed -E 's/\[([0-9],)*[0-9]\][0-9]+/\}\{/g' input.txt
}{
prop=1
interval=10
action=0
}{
address=AA:BB:CC:DD:11:22
path=/usr
We’ve searched for the blocks between square brackets with any number of digits and commas (\[([0-9],)*\]) followed by any number ([0-9]+). This block has been replaced with \}\{.
Comparing the two last snippets, we see that there are different operations we can ask sed to perform in order to get the same result. However, we’ll continue by using regular expressions since they tend to be more powerful.
2.2. Converting to JSON Name-Value Pairs
Let’s now work on the content of the JSON file:
$ sed -E 's/\[([0-9],)*[0-9]\][0-9]+/\}\{/g' input.txt | sed -E 's/([^ ]+)=([^ ]+)/"\1" : "\2",/g'
}{
"prop" : "1",
"interval" : "10",
"action" : "0",
}{
"address" : "AA:BB:CC:DD:11:22",
"path" : "/usr",
With ([^ ]+), we get a field that we can later use with the \X construct (being X a digit).
Thus, we get two items: one before the equal sign and one after. We surround those items with quotes, put a colon in between, and end with a comma. This is starting to look more like a JSON format.
2.3. Touch-Ups
The first and last lines are still not compliant with the format, so we need to work on them:
$ sed -E 's/\[([0-9],)*[0-9]\][0-9]+/\}\{/g' input.txt | sed -E 's/([^ ]+)=([^ ]+)/"\1" : "\2",/g' \
| sed -E '1s/.*/\{/' | sed -E '$s/.*/&\}/g'
{
"prop" : "1",
"interval" : "10",
"action" : "0",
}{
"address" : "AA:BB:CC:DD:11:22",
"path" : "/usr",}
With the first sed call, we replace everything (.*) in the first line (1s) with an opening curly bracket. With the second one, we append a closing curly bracket (\}) to the contents (&) of the last line ($s).
We could further improve the format by grouping the curly brackets with the content and removing some commas:
$ sed -E 's/\[([0-9],)*[0-9]\][0-9]+/\}\{/g' input.txt | sed -E 's/([^ ]+)=([^ ]+)/"\1" : "\2",/g' \
| sed -E '1s/.*/\{/' | sed -E '$s/.*/&\}/g' | sed -Ez 's/,[\n]*\}/\}\n/g' | sed -Ez 's/\{[\n]*/\{/g'
{"prop" : "1",
"interval" : "10",
"action" : "0"}
{"address" : "AA:BB:CC:DD:11:22",
"path" : "/usr"}
These two operations edit multiple lines at the same time, so we need the -z flag. First, we’re replacing commas followed by any number of newlines followed by a closing curly bracket ([\n]*\}) with a curly bracket and newline (\}\n). After that, we’re replacing the opening curly brackets immediately, followed by any number of newlines (\{[\n]*) with an opening curly bracket (\{).
2.4. sed Command Chaining
Finally, we can chain all sed commands with semicolons and then pipe the output to a file. We can do that only in blocks that share the same flags:
$ sed -E 's/\[([0-9],)*[0-9]\][0-9]+/\}\{/g; s/([^ ]+)=([^ ]+)/"\1" : "\2",/g; 1s/.*/\{/; $s/.*/&\}/g' input.txt \
| sed -Ez 's/,[\n]*\}/\}\n/g; s/\{[\n]*/\{/g' > output.txt
With this command, we can convert our input text to JSON format. Similarly, we can augment the processing according to the specific needs of the input format.
3. Using Scripting Languages
We can also use programming languages to manipulate text input. This might seem like overkill for a scenario such as this one, but the option is a good general solution.
Specifically, we cover the transformation using Python and Perl.
3.1. Python
We can transform the input text to JSON and save it in an output file using Python:
$ python -c "with open('input.txt', 'r') as input, open('output.txt', 'w') as output:\
[output.write('{\n'\
if line.startswith('[0')\
else '} {\n'\
if line.startswith('[')\
else '\"' + line.split('=')[0]+'\" : \"'+line.strip().split('=')[-1]+'\",\n')\
for line in input.readlines()];\
output.write('}\n')"
We’re calling python with the -c flag to pass regular Python code from the CLI within quotes. Even if we’ve added the backslash at the end of the line and tabulated it with spaces for easy reading, we can write a one-liner.
Let’s go through the code:
- open two files, input.txt as input for reading (‘r’) and output.txt as output for writing (‘w’)
- loop through all the lines of input and write to output at the same time based on the following rules
- if a line starts with [0, we write {\n to output
- if a line starts with just [, we write } {\n to output
- for all other cases, we write the opening quotes and the first item we get by splitting the input line at the equals sign and then close the double quotes, add a colon, and follow with the same for the other part
Finally, we close the curly bracket and add the customary newline at the end of input.
3.2. Perl
In Perl, we can use a similar script:
perl -l -n -e 'BEGIN{*1 = sub {print q<{>, join(",\n", splice @A, 0, @A), q<}>}}
next if $. == 1;
/^\[(?:\d+,?)+\]\d+$/ and &1,next;
push @A, join q/ : /, map qq/"$_"/, split /=/;
eof && &1;
' input.txt > output.txt
We call Perl with three flags that condition the script we provide:
- -l activates the automatic processing of the line endings
- -n loops around the provided code in a fashion similar to a while loop
- -e precedes a string that contains our Perl code
We start with a BEGIN block to define a subroutine using the typeglob *1. This subroutine prints the elements of the array @A enclosed in curly brackets and separated by commas and newlines.
The next part of the code skips processing the first input line.
Then, we check for the (^\[(?:\d+,?)+\]\d+$) pattern, i.e., a line starting with a sequence of numbers enclosed in square brackets followed by another number. If it matches, we call the subroutine and continue to the next iteration.
If the pattern matching fails, we start filling the @A array. Specifically, we split the line at the equals sign and mapped every element to a string with double quotes. After that, we join the elements with a colon and a couple of spaces before adding them to the array.
Finally, when we reach the end of the file, we again call the subroutine to print the elements.
4. Using the miller Dedicated Tool
There are also tools designed to simplify the processing and conversion of formatted text.
For example, we can use miller to convert text to JSON automatically. In this specific case, we still need some preprocessing:
$ sed 's/^\[.*//g' input.txt
prop=1
interval=10
action=0
address=AA:BB:CC:DD:11:22
path=/usr
However, miller takes care of the rest of the conversion:
$ sed 's/^\[.*//g' input.txt | mlr --ojson --irs lflf --ifs lf cat
[
{
"prop": 1,
"interval": 10,
"action": 0
},
{
"address": "AA:BB:CC:DD:11:22",
"path": "/usr"
}
]
We use flags to format output data as JSON (–ojson). Furthermore, we also specify that the input region separator (–irs) is a double new line using the Unicode control characters lf. Finally, we indicate that the input field separator (–ifs) is a single new line.
Finally, we can save the standard output into the file as we did previously:
$ sed 's/^\[.*//g' input.txt | mlr --ojson --irs lflf --ifs lf cat > output.txt
This tool simplifies the work and the time spent chaining sed commands or scripting in a programming language.
5. Conclusion
In this article, we talked about methods to convert custom text to JSON.
First, we discussed sed with two approaches: chaining several simpler substitutions and extended regular expressions. Then, we saw how to use a programming language such as Python or Perl. This might be overkill for some scenarios but it allows the most complex conversions.
Finally, we explored the miller utility that we can use to simplify this conversion. However, this comes at the expense of customization for our specific case.