1. Overview
In this tutorial, we’ll talk about different ways we can use curl to save several web addresses or URLs into different files in parallel. We might need this to increase the performance or efficiency of our computations.
There are other alternatives developed by the community such as wget2 and pget. However, we’ll focus on tools that we should have readily available on our system.
For simplicity, we’ll simply refer to the web addresses as url1, url2, url3, etc., even if they usually look like http://www.domain.extension/subpage/…/subpage/file. We’ll name the files where we want to save these webpages as out1, out2, out3, and so on.
2. Using curl Itself
The curl utility (together with wget) is one of the commands we can use to download and upload data from URLs non-interactively. The command has many uses and, depending on its version, we can directly use it to download multiple URLs at the same time.
If we’ve installed version 7.66.0 (or newer) of curl, we can directly use it to get parallel downloads with the -Z flag (or alternatively the –parallel flag):
$ curl -Z url1 -o out1 url2 -o out2 url3 -o out3 <...>
The -o flag specifies the name of the output file of the preceding URL. These downloads are done in a single process, which should be the faster and more efficient approach, especially when compared with the methods that we’ll see in later sections.
curl limits the number of downloads to 50 by default. If we specify more than 50 URLs, once curl completes the download of one of them, it starts the next one. We can also change this maximum value with –parallel-max [num].
We can find specifying the URLs and the output file names one by one interactively in the command line prone to error. However, we can also specify the URL-output file pairs in a text file and use it as input. This text file should contain the URL in one line and the output name in another:
$ cat urls_outputs.txt
url = "url1"
output = "out1"
url = "url2"
output = "out2"
<...>
We can then use this file as input for curl with the –config flag:
$ curl --parallel --config urls_outputs.txt
Another option of curl that might be relevant for our case is –parallel-immediate. When we’re doing any parallel transfer, we can use the –parallel-immediate option to open as many connections in parallel as possible, instead of waiting to establish new connections as multiplexed streams. The option is global, so once enabled, we’ll have to disable it with –no-parallel-immediate.
3. Multiple curl Instances With a for Loop
Since curl is a command with which we interact in the terminal, we can have as many curl processes running at the same time as desired, saving their outputs to different files. Thus, we can use a for loop to spawn as many curl processes as needed. For that we need two arrays, one for the URLs and the other for the output names:
$ urls=('URL1' 'URL2' 'URL3' <...>)
$ outs=('out1' 'out2' 'out3' <...>)
Once these arrays are defined, we can loop over them:
$ for i in "${!urls[@]}";
do
curl ${urls[$i]} -o ${outs[$i]} &
done
wait
This is useful if our curl version is old since we don’t need the -Z flag. The trick is to use the & at the end of the command to launch the curl process in the background and move on to the next command launch. We use “${!urls[@]}” to get the length of the array and wait to halt until all processes are complete.
The output of this method can get very messy because we’ll end up with multiple outputs of curl printed on the console at the same time. If we need to check the output, we can redirect the process output to a file. Otherwise, we can use the -s option to silence the output of curl and show nothing on the screen.
4. Using xargs and parallel to Launch Parallel Processes
There is an alternative to the for loop for old versions of curl. We can use xargs to run at the same time multiple curl processes. For this, we need one file that gathers all the URLs:
$ cat urls.txt
URL1
URL2
URL3
<...>
Then we can pass this file with xargs to curl, specifying the maximum number of parallel processes with -P [num] and -n 1 to have a single argument per command line:
$ xargs -P 2 -n 1 curl -O < urls.txt
The main drawback is that we cannot specify the output filenames (and that’s why we need the -O flag to automatically name the outputs). However, this is a one-liner that might be useful in some scenarios. We can achieve similar results with parallel.
With this method, we can speed up the downloads considerably. We should be careful not to overload our RAM memory with too many processes. Thus, we should keep a reasonable number for the -P argument.
5. Conclusion
In this article, we’ve talked about different ways to download multiple URLs with curl. If the curl version that we’ve installed is 7.66.0 or newer, we can directly use the -Z option. Otherwise, we can get around the older curl version by using a for loop or xargs.