4 Scripts and putting it all together

Learning Objectives:

Edit files

Use environment variables

Write loops to efficiently do repetitive tasks

Edit files on Xanadu in nano

Connect Visual Studio Code to Xanadu to edit files

Connect your desktop to Xanadu’s file system to access files

Write and execute scripts

We’ve now gotten some experience navigating the Linux file system, and inspecting and summarizing files. We’ve seen how we can use pipes to connect simple programs together to quickly get insight into our data. These are key ingredients to building up to analysis of data that you can be confident in. We’re now going to introduce some concepts that will help us write scripts, which are essentially shell commands organized into a file and executed as a batch, rather than executed interactively.

scripting is not categorically different from programming. The terms are generally used for polar ends of of a spectrum of complexity. If you, for example, write bash code to download some data from a database, decompress it, and set permissions, most people would refer to that as a scripting. In contrast, if you write code in c++ to parse thousands of sequence alignments and use complex numerical algorithms to estimate some statistical quantity from them, most people will refer to that as programming. In both cases, you are writing instructions for a computer to follow.

In this course we’re going to focus mostly on the simpler end of the spectrum: using shell scripts to automate chunks of data analysis performed by other programs written in languages such as c++, python, or R.

4.1 One-Liners

In the previous chapter we introduced the pipe as a way to link the inputs and outputs of multiple shell commands. Short piped commands are sometimes referred to as “one-liners” and you can find lots of useful repositories of them at pages like this and this. A quick skim through these, especially after we’ve moved a little further through the course will be super helpful to give you a sense of how short one-liners are commonly employed to inspect and summarize files.

4.2 Environment variables

A key feature of BASH is the use of environment variables. These are objects used to store text strings. These strings are typically used in 3 ways: - To generalize code in shell scripts. - To configure scripts or software. - To configure the system itself.

4.2.1 Invoking variables

Some variables are set automatically (usually those used to configure the system or shell options). To see a list try the command env. You should see a long list, but among them is a pretty straightforward variable: USER. To invoke it, try any one of the following:

echo $USER
echo ${USER}
echo "${USER}"

The key thing here is the $. The curly brackets and quotes can slightly modify how the variable itself is invoked or understood by the shell (more in a moment), but the $ is how you let the shell know you’re calling out a variable.

Variables can be interpolated into text strings or commands in a variety of ways. Try this, for example:

echo "My username is $USER"

4.2.2 Setting variables

Users can also create variables, or modify existing ones. To create a variable holding a path, for example:

MYHOMEDIR="/home/FCAM/$USER/"
echo $MYHOMEDIR

In this case you can see we’ve defined a variable using another variable. We can equally well define output files this way:

# construct file name and echo it
MYOUTFILE="/home/FCAM/$USER/${USER}_testfile.txt"
echo $MYOUTFILE

# write to the file and cat it out
echo "hello world!" >$MYOUTFILE
cat $MYOUTFILE

4.2.3 An aside about quotes in BASH

Now that we’re getting a little more into bash, we need to be careful when thinking about quotes and whitespace.

Try to assign the text string “Hello World” without quotes like this:

HW=Hello World
echo $HW

You should get the error -bash: World: command not found. This is because bash tried to assign the string “Hello” to variable “HW”, then saw the whitespace and treated “World” as a separate command, which happens not to exist. The entire line failed.

If we wrap the text in double quotes it will work:

HW="Hello World"
echo $HW

If we wrap the text in single quotes it will work:

HW='Hello World'
echo $HW

But what’s the difference between double and single quotes? Single quotes prevent variable interpolation, leading to very different results:

# using double quotes
HWD="Hello $USER"
echo $HWD

# using single quotes
HWS='Hello $USER'
echo $HWS

Quoting generally ensures whitespace is maintained as part of a string, which is sometimes very important. Single-quoting prevents variable interpolation (i.e. it treats strings literally) while double-quoting allows it. Tripping over quoting and whitespace is a common source of errors, even with intermediate level bash users.

4.2.4 The PATH variable

Another key variable you can see in the list produced by env is PATH. It’s a critical variable for configuration. It tells the shell where to look for programs you try to execute. Try

echo $PATH

You’ll see a colon-separated list of paths. Every program you successfully execute (without specifying an explicit path) will be found in one of those paths. You can add to this list if you want bash to look somewhere new in addition like this:

PATH=$PATH:/path/to/my/new/favorite/software

That is also generally how you can recursively modify variables, should you need to.

When executing programs it can be really useful to know which copy of a program is being executed, as sometimes there will be multiple versions installed on a given system (this is definitely true of Xanadu). You can use the command which for this:

which ls
which grep
which cat

You’ll note that each of these is found in a path found in your PATH variable. This will be less trivial and more important to be able to check when it comes to software used for data analysis on Xanadu.

Note

If you accidentally overwrite your PATH variable, you will suddenly find yourself unable to do anything useful at all, as bash will no longer know where to look for even simple commands like ls:

$ PATH=
$ ls
-bash: ls: No such file or directory

The easiest way to fix this is to just close your bash session and open a new one.

4.2.5 Generalizing code

So far we have been executing commands by explicitly referring to files, programs and their parameters. For example:

grep -o -P "(?<=[Ss]pecies )[A-Za-z]+" darwin1859.txt | sort | uniq -c | sort -g | tail

In this case, we have specified programs by name, several options, a regular expression, and an input file. We could easily make this line of code more general by replacing important parts of it with variables:

REGEX="(?<=[Ss]pecies )[A-Za-z]+"
INFILE=darwin1859.txt

grep -o -P "${REGEX}" $INFILE | sort | uniq -c | sort -g | tail

Note that quoting the regular expression variable: "${REGEX}" is necessary in this case. If you try without the quotes, some of those special characters will trip bash up, and by extension cause grep to fail.

Using variables like this can be especially useful in two cases:

When paths get extremely long and cumbersome and/or file names are not descriptive. Long paths can make code difficult to read. Obscuring them with descriptive variables allows you (and anyone else) to quickly see HOW you ran a program, instead of having to scan over multiple instances of /the/very/long/and/convoluted/path/to/my/precious/input/data.txt.
When code must be repeated many times on many different input files. Copying, pasting and editing the same line over and over again is a really bad idea. Even the most conscientious people are prone to mistakes when doing dull repetitive tasks. If we get used to using variables like this, we gain the flexibility to use loops and other means of parallelization that make our code less error prone, faster, and prefigure the way workflow languages we will explore in the next semester require us to think about and write code.

4.3 Loops

It’s often the case that we want to do something repeatedly. It may be that we want to do some analysis the same way for multiple input files, or do something with every element of a list, or line of a file. Sometimes, though less often in data analysis, we may want to repeat a process until some condition is met.

One common way to approach this is with a loop. A loop generally has the following very general syntax:

loop condition 
  { command1
    command2
    command3 }

In BASH we typically use for and while loops. A for loop iterates over a list of fixed length executing the same code each time. A while loop executes code repeatedly until some condition is no longer met.

4.3.1 `for` loops.

With for loops we can iterate over any sort of list. Let’s extract mentions of classic Darwin terms “species”, “forms”, and “varieties” from the Origin:

for WORD in species forms varieties
do
  grep "$WORD" darwin1859.txt >"${WORD}"_mentions.txt
done

In this syntax we tell bash we’re going to use the variable WORD, and for each iteration of the loop, populate it with an element of the list “species forms varieties”. The elements are separated by whitespace. The code to be executed for each iteration of the list is sandwiched between the keywords do and done. We redirect the output to three separate files named “species_mentions.txt”, etc.

We can specify the list of elements to be iterated over any number of ways. We can use a range of numbers or letters with a shell expansion:

for NUM in {1..10}
do
  echo $NUM
done

We can use a glob to iterate over files (assuming you ran the first loop):

for FILE in *mentions.txt
do 
  head "$FILE"
done

4.3.1.1 A digression about BASH arrays

Sometimes we may wish to construct a list outside of the context of the loop itself. It can be useful to store that list in a special type of variable called a bash array (or just “array”, but we will deal with other types of arrays later, so we’ll try to be specific). bash arrays are lists that require you to use a special syntax to create and access their elements.

bash arrays can be created simply with parentheses like this:

DARWIN=(species forms varieties)

Or using ls or find, or * to grab files:

DARWIN=(*mentions.txt)

DARWIN=($(ls *mentions.txt))

In this last case the syntax $(command) is a command substitution. It means run the command and insert the output in place (in this case, inside the () used to define the array). Also note that the array elements will be parsed based on whitespace, so if file names contain whitespace (you should avoid this always) this approach will break.

bash arrays are zero-indexed, meaning the first element is element 0. We can access elements with this syntax:

echo ${DARWIN[0]}

To write out ALL the elements:

echo ${DARWIN[@]}

To get the length of the array:

echo ${#DARWIN[@]}

To use this in a loop we can either provide the whole array like this:

DARWIN=($(ls *mentions.txt))

for FILE in ${DARWIN[@]}
do
  echo "FIRST TEN LINES OF $FILE -------------------"
  head $FILE
done

Or we can iterate over the indexes like this:

DARWIN=($(ls *mentions.txt))

for NUM in {0..2}
do
  echo "FIRST TEN LINES OF ${DARWIN[$NUM]} -------------------"
  head  ${DARWIN[$NUM]}
done

4.3.2 `while` loops.

while loops differ from for loops in that they evaluate repeatedly until some condition is no longer satisfied. You can use:

COUNTER=1

while [ $COUNTER -le 5 ]
do
    echo "Count: $COUNTER"
    ((COUNTER++))
done

Where [ $COUNTER -le 5 ] is a conditional construct indicating COUNTER must be less than 5. If this evaluates to false, the loop ends. Conditional constructs are a general way to manage when code is executed, and we will cover them more later. ((COUNTER++)) adds one to COUNTER every time the loop executes.

You can also use while to iterate over lines in a file:

FILE=darwin1859.txt
COUNTER=1

while IFS= read -r line ; do
    echo "Processing line $COUNTER: $line"
    ((COUNTER++))
done < $FILE

In this case we’re using < $FILE to pass the contents of FILE to the standard input for the while loop.

line is the variable populated for each iteration.

IFS= is variable read by read that defines the input field separator. Because this is empty, the entire line is treated as a single string.

read -r reads a line in and with -r treats all characters as literal (e.g. \t does not mean tab).

When FILE runs out of lines, the loop will break.

4.4 Editing Files

We’ve covered lots of basic bash features. We’re almost ready to start writing scripts. Now that you’ve been writing lots of commands in the terminal, and you recognize you’re connected to a remote computer cluster that your local file system doesn’t have access to, you’re probably starting to wonder, how do I actually write and save lots of code on the remote server?

In this section we’ll deal with that issue and cover some tools you can use to write and edit code, both locally and remotely.

4.4.1 Editors

4.4.1.1 Command-line text editors

The most straightforward approach to this is to use CLI text editors that are already available on Xanadu (and most Linux systems). Since you run them on Xanadu, they can directly create and access files. The most common ones are nano, vim and emacs. vim and emacs are very powerful editors. They are highly customizable and have large user communities. They have steep learning curves, however, and there are accessible and powerful alternatives to CLI editors, so in this course we are only going to introduce nano. nano is simple to use and suitable for quick edits and copy-paste operations.

If you simply type nano on the command line, it will open a new text document. You can immediately begin typing whatever you like. When you’re done, you can type ctrl-x and it will ask you if you want to save the file. If you do, type y, then when prompted, write a file name and hit enter. That’s it. If you want to edit an existing file, type nano filename. Save and exit the same way. ctrl-c cancels.

You can navigate around with the arrows on your keyboard (and some keyboard shortcuts covered in a video). That’s it. If you use nano a bit, you’ll get a feeling for the pretty big tradeoff nano makes between power and simplicity.

Note

If you’re determined to learn a CLI editor (not necessarily a bad idea!) vim and various extensions of it are probably more popular than emacs these days. VIM Adventures is a game that can help you learn it.

4.4.1.2 GUI code editors

Another approach suitable for users starting up is to write code using a locally installed, dedicated code editor such as Sublime Text. There are others, but for this course I recommend you download and use Sublime for at least some cases. Unlike nano, it has lots of features and user-created plugins. It has syntax highlighting, which means if you select the language you are writing in from a dropdown menu in the bottom right corner, it will recognize syntactic features of the language and change the color of the text in ways that greatly help with editing.

A straightforward way to use Sublime to write scripts is to write code locally, and then paste it into documents using nano, or use scp/rsync to upload the documents to Xanadu. These solutions are not exactly ideal, but they work in a pinch. It is possible to connect Sublime through an ssh tunnel to give it access to documents on Xanadu, but we haven’t yet shown you the tools to do that, and we’ll introduce another, easier approach in a moment.

4.4.1.3 Integrated development environments.

IDEs are dedicated code editors with lots more features. Visual Studio Code is another program we recommend you download and install locally. It also has syntax highlighting, but you’ll notice you can open up projects and see the entire directory structure, and if you want you can even open a terminal inside it. There is a plugin that will let you connect VS Code to Xanadu and open and edit files remotely (discussed below). Xanadu’s operating system is a bit old (it will soon be replaced by a new cluster, Mantis), so for remote editing you will need to install an older version of VS Code, 1.85, which is available here.

4.4.2 Conveniently accessing files.

Ok, so aside from using nano (or learning to be a vim power user), how can you conveniently access and edit files on Xanadu? There are two relatively straightforward ways:

Visual Studio Code has an extension Remote - SSH. This is highly recommended. You can connect VS Code to Xanadu directly via SSH. You can open windows focusing on specific directories, visualize the directory structure, and create and edit files directly on Xanadu. You can also open a Xanadu terminal window so that you can test and execute code, all in VS Code.

Warning

Please don’t use language-specific code extensions (python, R, etc) to run code on Xanadu from within VS Code. We’ll cover this in the next chapter, but VS Code connects to a login node, which has few resources and is shared by many users. To run code on Xanadu, it must be submitted through the batch scheduler, SLURM. A VS Code extension that runs code for you can only run it on the login node, which will cause problems for everyone (something we will discuss in more detail in chapter 5).

You can mount the Xanadu file system on your local computer. You can get your operating system to “see” Xanadu’s filesystem, and access files as if they were local. You can then edit them using Sublime, VS Code, or anything else. To do this, you must be connected to the CAM VPN (instructions).

From a mac, you can select from your top dropdown menu Go:Connect to Server and enter smb://cfs09.cam.uchc.edu/home/FCAM/<username>.
From Windows, you can map a network filesystem using these directions and the address formatted like this: \\cfs09.cam.uchc.edu\home\FCAM\<username>

When prompted, enter your CAM credentials (not your netID/password). For paths on /core, replace cfs09 with cfs12. df -h will tell you which file systems various important system directories are found on.

Warning

You should only use dedicated software for editing code (or CLI editors on Xanadu). Editing using word processors (such as MS Word) will often lead to difficult to diagnose problems. Word processors often insert hidden characters and/or use line break characters incompatible with Linux. These will cause confusing errors when executing code. The first place to start if you think this might be the issue is to try cat -A myscript.sh. If myscript.sh contains any weird hidden characters, this will print them. Compare to a script you know works. Non-standard line breaks are often the culprit and this will reveal them.

4.5 Scripts

Ok, now we’ve got all the pieces, we can start writing longer bits of code into scripts so that we can run them as batches instead of interactively, line-by-line. Scripts can be as simple as a series of commands that you could execute interactively, or as complicated as computer programs that run entire analyses that take hours, days or weeks.

In this course, we’ll aim for the simpler end of the spectrum. We want to write code in chunks that we can pass off to Xanadu’s job scheduler and that will serve as the most fundamental documentation of the analyses we do. While keeping notes on what you do (or try) is important, the code itself is the ultimate documentation of your final analysis, so it should be complete and well organized into scripts.

4.5.1 The shebang

A script can be written in any language, so typically the way we start one off in a Linux environment is to tell Linux what interpreter to use. Right now we’re writing bash code (rather than R, python, perl or something else), so we need to tell Linux that. To do that, the first line of the script will always be what’s referred to as a shebang, or #! followed by the interpreter we wish to use, in this case /bin/bash so:

#!/bin/bash

If we were writing python code we might write:

#!/bin/env python

This would tell the shell to use whichever python interpreter was found by searching our PATH variable. After the shebang line, we can start writing code.

4.5.2 Comment lines

Despite our best efforts, code can sometimes be complicated and confusing to read. Because of this, it’s really important to document your code with comments. In shell scripting, you can include lines prefixed with #, that will be ignored by the interpreter. You can and should write notes on these lines explaining what the code does.

#!/bin/bash

# this script will print the 10 most common words found in a text file and their frequencies

# this line specifies the text file
ORIGIN=darwin1859.txt

# this line extracts and prints the word list
grep -o -P "\b[A-Za-z]{4,}\b" $ORIGIN | sort | uniq -c | sort -g | tail -n 10

Create this script with the title commonWords.sh on Xanadu and change the permission so that you can execute it.

4.5.3 Executing scripts

There are a few ways to execute scripts.

source commonWords.sh This method ignores the shebang (it begins with # after all) and executes all the code in the current shell session as if you had typed it on the command line. If you created any variables in the current environment they will be available in the script.
bash commonWords.sh This method also ignores the shebang and explicitly invokes bash, but it creates a subshell. The execution context is mostly isolated from the environment you submitted the script from, and environment variables you created there are not available unless you export them (see below).
./commonWords.sh This will execute the script, and Linux looks for the shebang to see which intepreter should be used. It also creates an isolated execution context. For options 1 and 2, the script does not need to have execute permission. To run it this way, execute permission is required.
Passing the script to a job scheduler (in our case SLURM). This will be covered in the next chapter.

What does it mean to have an isolated environment? Modify the commonWords.sh script so that the variable definition is commented out and thus ignored by the interpreter:

#!/bin/bash

# this script will print the 10 most common words found in a text file and their frequencies

# this line specifies the text file
# ORIGIN=darwin1859.txt

# this line extracts and prints the word list
grep -o -P "\b[A-Za-z]{4,}\b" $ORIGIN | sort | uniq -c | sort -g | tail -n 10

If you try to execute it, you will find that it hangs indefinitely because the ORIGIN variable is empty.

If you try to define the ORIGIN variable and then run the script it will work with the source method:

ORIGIN=darwin1859.txt
source commonWords.sh

But for it to work with the other two methods, you need to export the variable:

export ORIGIN=darwin1859.txt
./commonWords.sh

Exporting the variable causes it to be “inherited” by any processes spawned by the current shell.

Another big difference between source and the other two methods is that with source, any variables created within the script will be available in the current shell session after the script has completed.

4.5.3.1 Command-line arguments

Note that we can generalize our script by writing it so that it expects command-line input from the user:

#!/bin/bash

# this script will print the 10 most common words found in a text file and their frequencies

# this line specifies the text file
ORIGIN="$1"
REGEX="$2"

# this line extracts and prints the word list
grep -o -P "$REGEX" $ORIGIN | sort | uniq -c | sort -g | tail -n 10

Save this script to the file test.sh and make it executable. We can now run it like this:

./test.sh darwin1859.txt "\b[A-Za-z]{4,}\b"

The variables $1 and $2 are automatically parsed as whitespace separated arguments provided on the command line when executing the script. There are more complex ways to provide and parse command-line arguments, but we won’t cover them here.

4.5.4 Errors and exit codes

Code can fail for many reasons. You will inevitably make mistakes in command line usage for some program, refer to paths that don’t exist, or use incorrect syntax, and an error will result. Computer clusters are complex, and they also inevitably have problems. Sometimes a compute node will crash. Sometimes a user will accidentally get around whatever guardrails the system administrators have in place and muck up some important settings. Troubleshooting these errors can be a major challenge for beginners.

4.5.4.1 Common errors

Here are a few common errors you are likely to encounter:

“No such file or directory”

cat /path/does/not/exist.txt

Leads to cat: cannot access /path/does/not/exist.txt: No such file or directory.

Check your path/filename.

“Permission denied”

echo "I'm going to put a file in the root directory of the cluster!" >/file.txt

Leads to bash: /file.txt: Permission denied

Check that your path is correct and the permissions for the file or the directory. Think about whether you actually should have permission. If you own or share the file or the directory, fix the permissions. If someone else owns it, ask them to fix the permissions. Note: You are not permitted to open permissions on your home directory so other users can access it. It’s a security risk.

bad quoting

grep endless forms darwin1859.txt

Leads to search results for “endless” but also grep: forms: No such file or directory. This is a quoting issue. forms is treated as another file to be searched because of the whitespace. It should be grep "endless forms" darwin1859.txt

“Command not found”

gep "endless forms" darwin1859.txt

Leads to bash: gep: command not found. Classic typo. Command not found can also be a sign of incorrect quoting, or the dreaded incorrect line breaks that arise from editing code in non-code-specific editors.

4.5.4.2 Exit codes

When a command is evaluated, bash produces an exit code and stores it in the special variable $?. If the command executes without error, the exit code will be 0. If there is an error it will be greater than 0.

Check the exit codes for these two examples:

# this should work
grep "endless forms" darwin1859.txt
echo $?

# this should yield an error
gep "endless forms" darwin1859.txt
echo $?

With scripts set up as we have above, the exit code will refer to the last command in the script. So if there are two commands in a script, and the second one completes successfully, but the first fails, the exit code will still be 0. By default the script wont just quit when it hits an error. There are settings that can change the behavior of bash when it encounters errors, but there is some controversy about whether you should use them or not.

“There is some controversy” is often the case with bash scripting. BASH is a wonky language, a lot of smart opinionated people with different aims use it. It takes years to truly become an expert in it. When learning to do data analysis, we need to do our best, be vigilant about errors, and be open to learning new ways of doing things when we discover problems with our established habits. Pages like this are great for intermediate bash users to consider, but “best practices” are not necessarily universal.

4.6 Exercises

See Blackboard Ultra for this section’s exercises.

4.1 One-Liners

4.2 Environment variables

4.2.1 Invoking variables

4.2.2 Setting variables

4.2.3 An aside about quotes in BASH

4.2.4 The PATH variable

4.2.5 Generalizing code

4.3 Loops

4.3.1 for loops.

4.3.1.1 A digression about BASH arrays

4.3.2 while loops.

4.4 Editing Files

4.4.1 Editors

4.4.1.1 Command-line text editors

4.4.1.2 GUI code editors

4.4.1.3 Integrated development environments.

4.4.2 Conveniently accessing files.

4.5 Scripts

4.5.1 The shebang

4.5.2 Comment lines

4.5.3 Executing scripts

4.5.3.1 Command-line arguments

4.5.4 Errors and exit codes

4.5.4.1 Common errors

4.5.4.2 Exit codes

4.6 Exercises

4.3.1 `for` loops.

4.3.2 `while` loops.