Pipes and Filters
Overview
Teaching: 15 min
Exercises: 0 minQuestions
How can I combine existing commands to do new things?
Objectives
Using Wildcards
When run in the
molecules
directory, whichls
command(s) will produce this output?
ethane.pdb methane.pdb
ls *t*ane.pdb
ls *t?ne.*
ls *t??ne.pdb
ls ethane.*
Solution
The solution is
3.
1.
shows all files that contain any number and combination of characters, followed by the lettert
, another single character, and end withane.pdb
. This includesoctane.pdb
andpentane.pdb
.
2.
shows all files containing any number and combination of characters,t
, another single character,ne.
followed by any number and combination of characters. This will give usoctane.pdb
andpentane.pdb
but doesn’t match anything which ends inthane.pdb
.
3.
fixes the problems of option 2 by matching two characters betweent
andne
. This is the solution.
4.
only shows files starting withethane.
.
What Does
sort -n
Do?If we run
sort
on this file:10 2 19 22 6
the output is:
10 19 2 22 6
If we run
sort -n
on the same input, we get this instead:2 6 10 19 22
Explain why
-n
has this effect.Solution
The
-n
flag specifies a numeric sort, rather than alphabetical.
Piping Commands Together
In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?
wc -l * > sort -n > head -n 3
wc -l * | sort -n | head -n 1-3
wc -l * | head -n 3 | sort -n
wc -l * | sort -n | head -n 3
Solution
Option 4 is the solution. The pipe character
|
is used to feed the standard output from one process to the standard input of another.>
is used to redirect standard output to a file. Try it in thedata-shell/molecules
directory!
Why Does
uniq
Only Remove Adjacent Duplicates?The command
uniq
removes adjacent duplicated lines from its input. For example, the filedata-shell/data/salmon.txt
contains:coho coho steelhead coho steelhead steelhead
Running the command
uniq salmon.txt
from thedata-shell/data
directory produces:coho steelhead coho steelhead
Why do you think
uniq
only removes adjacent duplicated lines? (Hint: think about very large data sets.) What other command could you combine with it in a pipe to remove all duplicated lines?Solution
$ sort salmon.txt | uniq
Removing Unneeded Files
Suppose you want to delete your processed data files, and only keep your raw files and processing script to save storage. The raw files end in
.dat
and the processed files end in.txt
. Which of the following would remove all the processed data files, and only the processed data files?
rm ?.txt
rm *.txt
rm * .txt
rm *.*
Solution
- This would remove
.txt
files with one-character names- This is correct answer
- The shell would expand
*
to match everything in the current directory, so the command would try to remove all matched files and an additional file called.txt
- The shell would expand
*.*
to match all files with any extension, so this command would delete all files
Wildcard Expressions
Wildcard expressions can be very complex, but you can sometimes write them in ways that only use simple syntax, at the expense of being a bit more verbose.
Consider the directorydata-shell/north-pacific-gyre/2012-07-03
: the wildcard expression*[AB].txt
matches all files ending inA.txt
orB.txt
. Imagine you forgot about this.
Can you match the same set of files with basic wildcard expressions that do not use the
[]
syntax? Hint: You may need more than one expression.The expression that you found and the expression from the lesson match the same set of files in this example. What is the small difference between the outputs?
Under what circumstances would your new expression produce an error message where the original one would not?
Solution
1.
``` $ ls *A.txt $ ls *B.txt ``` {: .bash} 2. The output from the new commands is separated because there are two commands. 3. When there are no files ending in `A.txt`, or there are no files ending in `B.txt`.
Which Pipe?
The file
data-shell/data/animals.txt
contains 586 lines of data formatted as follows:2012-11-05,deer 2012-11-05,rabbit 2012-11-05,raccoon 2012-11-06,rabbit ...
Assuming your current directory is
data-shell/data/
, what command would you use to produce a table that shows the total count of each type of animal in the file?
grep {deer, rabbit, raccoon, deer, fox, bear} animals.txt | wc -l
sort animals.txt | uniq -c
sort -t, -k2,2 animals.txt | uniq -c
cut -d, -f 2 animals.txt | uniq -c
cut -d, -f 2 animals.txt | sort | uniq -c
cut -d, -f 2 animals.txt | sort | uniq -c | wc -l
Solution
Option 5. is the correct answer. If you have difficulty understanding why, try running the commands, or sub-sections of the pipelines (make sure you are in the
data-shell/data
directory).
Key Points