1. Overview
Bash scripts are essential for automating system-level tasks in Linux, whereas Python provides advanced libraries for solving complex problems, including data analysis tasks. By calling Python scripts from within Bash, we can perform a wider range of complex tasks and automate workflows efficiently.
In this tutorial, we’ll explore how to call the Python interpreter from a Bash script and how to embed Python code within Bash.
2. Sample Task
Let’s suppose we have a comma-delimited (CSV) data file named db.csv. First, we check the contents of db.csv with cat:
$ cat ./db.csv
ID,NAME,OCCUPATION
1,Ron,Engineer
2,Lin,Engineer
3,Tom,Architect
4,Mat,Engineer
5,Ray,Botanist
6,Val,Architect
Our goal is to add a column to the data consisting of True or False Boolean values depending on the value of the OCCUPATION field in any given row. In particular, a value of True is appended to a row if the occupation entry occurs overall with neither the minimum nor maximum frequency. Otherwise, a value of False is appended to the row.
In this case, the occupation that occurs with the minimum frequency is Botanist, whereas that which occurs with the maximum frequency is Engineer. Therefore, we should append a True value only to rows where the occupation is Architect and a False value to rows where the occupation is Botanist or Engineer.
Let’s explore how we can call Python from a Bash script to handle the data processing of this task.
3. Using python3 -c
For smaller tasks, we can call Python via python3 -c as a one-liner. The -c option is for specifying Python commands within a string.
For example, we can call Python to read and display our CSV as a data frame:
$ python3 -c 'import pandas as pd; df = pd.read_csv("./db.csv"); print(df.to_string(index=False))'
ID NAME OCCUPATION
1 Ron Engineer
2 Lin Engineer
3 Tom Architect
4 Mat Engineer
5 Ray Botanist
6 Val Architect
Here, we imported the pandas module to read the CSV file via the pd.read_csv() method. Then, to display the data frame without the default index column, we converted it to a string and set the index option to False.
We can also pass arguments to the Python command:
$ path_to_file='./db.csv'
$ python3 -c 'import pandas as pd; import sys; df = pd.read_csv(sys.argv[1]); print(df.to_string(index=False))' "$path_to_file"
ID NAME OCCUPATION
1 Ron Engineer
2 Lin Engineer
3 Tom Architect
4 Mat Engineer
5 Ray Botanist
6 Val Architect
In this case, we pass the file path as an argument instead of hard-coding it within the command string. To do so, we imported the sys module and used sys.argv[1] to refer to the provided argument with index 1.
4. Using a Standalone Python Script
For larger tasks, such as our sample task, we can use a standalone Python script and call it from within Bash:
$ cat ./count_filter.py
import pandas as pd
import numpy as np
import sys
path_to_file = sys.argv[1]
df = pd.read_csv(path_to_file)
values, counts = np.unique(df['OCCUPATION'], return_counts=True)
x = [values[i] for i in range(len(values)) if counts[i] not in [min(counts), max(counts)]]
df['Selected'] = df['OCCUPATION'].isin(x)
df.to_csv('./result.csv', index=False, header=False)
The Python script, named count_filter.py, performs a number of steps:
- import the pandas, numpy, and sys modules
- define the file path as the first argument passed to the script
- load the CSV file into a data frame variable named df
- call np.unique() over the OCCUPATION column to return the occupation values and their counts
- use list comprehension to select occupations that occur with neither the minimum nor maximum count
- append a column, named Selected, with Boolean values depending on the status of an occupation with respect to the previous step
- save the new data frame to a CSV file named result.csv, without the default index column or header
Next, we call the Python interpreter to execute the count_filter.py script while providing the file path as an argument:
$ python3 ./count_filter.py ./db.csv
This saves the result in a new file named result.csv:
$ cat ./result.csv
1,Ron,Engineer,False
2,Lin,Engineer,False
3,Tom,Architect,True
4,Mat,Engineer,False
5,Ray,Botanist,False
6,Val,Architect,True
We see that only the third and sixth rows get True appended since the Architect occupation occurs with neither the minimum nor maximum frequency.
5. Using a Here-Document
Another option is to embed the Python code explicitly within the Bash script using a here-document.
In this case, we have two options:
- save the Python code to an intermediate script file and then call the Python interpreter
- directly execute the code, skipping the intermediate file
Let’s explore both approaches.
5.1. Saving to a Script File
Using a here-document, we can simply direct the content to a script file, named count_filter.py:
$ cat sample_task.sh
#!/usr/bin/env bash
cat << EOF > ./count_filter.py
import pandas as pd
import numpy as np
import sys
path_to_file = sys.argv[1]
df = pd.read_csv(path_to_file)
values, counts = np.unique(df['OCCUPATION'], return_counts=True)
x = [values[i] for i in range(len(values)) if counts[i] not in [min(counts), max(counts)]]
df['Selected'] = df['OCCUPATION'].isin(x)
df.to_csv('./result.csv', index=False, header=False)
EOF
python3 ./count_filter.py ./db.csv
We’ve used the EOF marker within the sample_task.sh Bash script to mark the start and end of the here-document. The Python code is the same as that of the standalone script discussed earlier.
Then, in the last line, we call the Python interpreter over the count_filter.py script while passing the path to the CSV file as an argument.
Let’s grant execute permissions to the Bash script via chmod:
$ chmod +x ./sample_task.sh
Finally, let’s run the script:
$ ./sample_task.sh
The result is saved in the result.csv file, overwriting the previous version:
$ cat ./result.csv
1,Ron,Engineer,False
2,Lin,Engineer,False
3,Tom,Architect,True
4,Mat,Engineer,False
5,Ray,Botanist,False
6,Val,Architect,True
We see the same outcome as before.
5.2. Running Python Code Directly
Alternatively, we can skip saving the contents of the here-document to an intermediate file:
$ cat sample_task.sh
#!/usr/bin/env bash
python3 - ./db.csv << EOF
import pandas as pd
import numpy as np
import sys
path_to_file = sys.argv[1]
df = pd.read_csv(path_to_file)
values, counts = np.unique(df['OCCUPATION'], return_counts=True)
x = [values[i] for i in range(len(values)) if counts[i] not in [min(counts), max(counts)]]
df['Selected'] = df['OCCUPATION'].isin(x)
df.to_csv('./result.csv', index=False, header=False)
EOF
The difference here is that we call python3 directly over the contents of the here-document. The Python code is passed to the python3 command via stdin indicated by a hyphen, while the file path to the CSV data is passed as an argument.
Next, we run the Bash script and view the result.csv file that was generated:
$ ./sample_task.sh
$ cat ./result.csv
1,Ron,Engineer,False
2,Lin,Engineer,False
3,Tom,Architect,True
4,Mat,Engineer,False
5,Ray,Botanist,False
6,Val,Architect,True
Of course, we obtain the same result as before.
6. Conclusion
In this article, we explored different methods for calling Python from within a Bash script. In particular, we saw how to use the python3 -c command, as well as how to call a standalone Python script, and how to use a here-document for embedding Python code within a Bash script.