Pipes and Filters

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • How can I combine existing commands to do new things?

Objectives

Using Wildcards

When run in the molecules directory, which ls command(s) will produce this output?

ethane.pdb methane.pdb

  1. ls *t*ane.pdb
  2. ls *t?ne.*
  3. ls *t??ne.pdb
  4. ls ethane.*

Solution

The solution is 3.

1. shows all files that contain any number and combination of characters, followed by the letter t, another single character, and end with ane.pdb. This includes octane.pdb and pentane.pdb.

2. shows all files containing any number and combination of characters, t, another single character, ne. followed by any number and combination of characters. This will give us octane.pdb and pentane.pdb but doesn’t match anything which ends in thane.pdb.

3. fixes the problems of option 2 by matching two characters between t and ne. This is the solution.

4. only shows files starting with ethane..

What Does sort -n Do?

If we run sort on this file:

10
2
19
22
6

the output is:

10
19
2
22
6

If we run sort -n on the same input, we get this instead:

2
6
10
19
22

Explain why -n has this effect.

Solution

The -n flag specifies a numeric sort, rather than alphabetical.

Piping Commands Together

In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?

  1. wc -l * > sort -n > head -n 3
  2. wc -l * | sort -n | head -n 1-3
  3. wc -l * | head -n 3 | sort -n
  4. wc -l * | sort -n | head -n 3

Solution

Option 4 is the solution. The pipe character | is used to feed the standard output from one process to the standard input of another. > is used to redirect standard output to a file. Try it in the data-shell/molecules directory!

Why Does uniq Only Remove Adjacent Duplicates?

The command uniq removes adjacent duplicated lines from its input. For example, the file data-shell/data/salmon.txt contains:

coho
coho
steelhead
coho
steelhead
steelhead

Running the command uniq salmon.txt from the data-shell/data directory produces:

coho
steelhead
coho
steelhead

Why do you think uniq only removes adjacent duplicated lines? (Hint: think about very large data sets.) What other command could you combine with it in a pipe to remove all duplicated lines?

Solution

$ sort salmon.txt | uniq

Removing Unneeded Files

Suppose you want to delete your processed data files, and only keep your raw files and processing script to save storage. The raw files end in .dat and the processed files end in .txt. Which of the following would remove all the processed data files, and only the processed data files?

  1. rm ?.txt
  2. rm *.txt
  3. rm * .txt
  4. rm *.*

Solution

  1. This would remove .txt files with one-character names
  2. This is correct answer
  3. The shell would expand * to match everything in the current directory, so the command would try to remove all matched files and an additional file called .txt
  4. The shell would expand *.* to match all files with any extension, so this command would delete all files

Wildcard Expressions

Wildcard expressions can be very complex, but you can sometimes write them in ways that only use simple syntax, at the expense of being a bit more verbose.
Consider the directory data-shell/north-pacific-gyre/2012-07-03 : the wildcard expression *[AB].txt matches all files ending in A.txt or B.txt. Imagine you forgot about this.

  1. Can you match the same set of files with basic wildcard expressions that do not use the [] syntax? Hint: You may need more than one expression.

  2. The expression that you found and the expression from the lesson match the same set of files in this example. What is the small difference between the outputs?

  3. Under what circumstances would your new expression produce an error message where the original one would not?

Solution

1.

```
$ ls *A.txt
$ ls *B.txt
```
{: .bash} 2. The output from the new commands is separated because there are two commands. 3. When there are no files ending in `A.txt`, or there are no files ending in `B.txt`.

Which Pipe?

The file data-shell/data/animals.txt contains 586 lines of data formatted as follows:

2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
...

Assuming your current directory is data-shell/data/, what command would you use to produce a table that shows the total count of each type of animal in the file?

  1. grep {deer, rabbit, raccoon, deer, fox, bear} animals.txt | wc -l
  2. sort animals.txt | uniq -c
  3. sort -t, -k2,2 animals.txt | uniq -c
  4. cut -d, -f 2 animals.txt | uniq -c
  5. cut -d, -f 2 animals.txt | sort | uniq -c
  6. cut -d, -f 2 animals.txt | sort | uniq -c | wc -l

Solution

Option 5. is the correct answer. If you have difficulty understanding why, try running the commands, or sub-sections of the pipelines (make sure you are in the data-shell/data directory).

Key Points