Introduc)on
to Bioinforma)cs online course: IBT
Linux
Extrac)ng informa)on from files
Introduc)on to Bioinforma)cs online course: IBT
Linux | Amel Ghouila
Learning Objec)ves
① Learn how to search pa/erns in files and how
to extract specific data
② Learn how to sort files content
③ Learn basic commands to compare files
content
④ Learn results redirec;on
⑤ Learn commands combina;on
Introduc)on to Bioinforma)cs online course: IBT
Linux | Amel Ghouila
Learning Outcomes
① Be able to search pa/erns in files extract
specific data
② Be able to sort files content
③ Be able to use some basic commands to
compare files content
④ Know how to write commands results into a
file
⑤ Be able to combine different commands
Introduc)on to Bioinforma)cs online course: IBT
Linux | Amel Ghouila
Part 1
Basic opera)ons on files and data
extrac)on
Introduc)on to Bioinforma)cs online course: IBT
Linux | Amel Ghouila
Some sta)s)cs about your file
content: wc command
• wc prints newline, word, and byte counts for each
file
• syntax: wc <op;ons> <filename>
• Some useful op;ons:
• -c: prints the byte counts
• -m: prints the character counts
• -l: prints the newline counts
• For more info about the different commands use
man commandname
Introduc)on to Bioinforma)cs online course: IBT
5
Linux | Amel Ghouila
Basics opera)on on files
• sort: reorder the content of a file “alphabe;cally”
syntax: sort <filename>
• uniq: removes duplicated lines
syntax: uniq <filename>
• join: compare the contents of 2 files, outputs the
common entries
syntax: join <filename1> <filename2>
• diff: compare the contents of 2 files, outputs the
differences
syntax: diff <filename1> <filename2>
Introduc)on to Bioinforma)cs online course: IBT
6
Linux | Amel Ghouila
Sor)ng data
• sort outputs a sorted order of the file content
based on a specified sort key (default: takes
en;re input)
• Syntax: sort <op;ons> <filename>
Sor)ng data
• Default field separator: Blank
• Sorted files are used as an input for several other
commands so sort is oWen used in combina;on
to other commands
• For <op;ons> see man
Introduc)on to Bioinforma)cs online course: IBT
7
Linux | Amel Ghouila
Sor)ng data: examples
w Sort alphabe;cally (default op;on): sort <filename>
w Sort numerically: sort -n <filename>
w Sort on a specific column (n°4): sort –k 4 <filename>
w Sort based on a tab separator: sort -t $'\t’ <filename>
w ...
Introduc)on to Bioinforma)cs online course: IBT
8
Linux | Amel Ghouila
Extrac)ng data from files
• grep: to search for the occurrence of a specific
pa/ern (regular expression using the wildcards…)
in a file
Syntax: grep <paRern> <filename>
• cut: is used to extract specific fields from a file
Syntax: cut <op)ons> <filename>
Introduc)on to Bioinforma)cs online course: IBT
9
Linux | Amel Ghouila
grep command
• grep (“global regular expression profile”) is used to
search for the occurrence of a specific pa/ern (regular
expression…) in a file
• Grep output the whole line containing that pa/ern
• For <op;ons> see man
Example:
Extract lines containing the pa1ern xxx from a file:
grep xxx <filename>
Extract lines that do not contain pa1ern xxx from a file:
grep –v xxx <filename>
Introduc)on to Bioinforma)cs online course: IBT
10
Linux | Amel Ghouila
grep example
Let’s consider a file named “ghandi.txt”
$ cat ghandi.txt
The difference between what we do
and what we are capable of doing
would suffice to solve
most of the world's problems
$ grep what ghandi.txt
The difference between what we do
and what we are capable of doing
$ grep -v what ghandi.txt
would suffice to solve
most of the world's problems
Introduc)on to Bioinforma)cs online course: IBT
11
Linux | Amel Ghouila
cut command
• cut is used to extract specific fields from a file
• Structure: cut <op;ons> <filename>
• For <op;ons> see man
• Important op;ons are
w -d (field delimiter)
w -f (field specifier)
Example:
extract fields 2 and 3 from a file having ‘space’ as a separator
cut -d’ ‘ -f2,3 <filename>
Introduc)on to Bioinforma)cs online course: IBT
12
Linux | Amel Ghouila
uniq command
• uniq outputs a file with no duplicated lines
• Uniq requires a sorted file as an input
• Syntax: uniq <op;ons> <sorted_filename>
• For <op;ons> see man
• Useful op;on is -c to output each line with its
number of repeats
Introduc)on to Bioinforma)cs online course: IBT
13
Linux | Amel Ghouila
Join command
• join is used to compare 2 input files based on the
entries in a common field (called “join field”) and
outputs a merged file
• join requires sorted files as an input
• Lines with iden;;cal “join field” will be present only
once in the output
• Structure:
join <op;ons> <filename1> <filename2>
• For <op;ons> see man
Introduc)on to Bioinforma)cs online course: IBT
14
Linux | Amel Ghouila
diff command
• diff is used to compare 2 input files and displays the
different entries
• Can be used to highlight differences between 2
versions of the same file
• Default output: common lines not showed, only
different lines are indicated and shows what has
been added (a), deleted (d) or changed (c)
• Structure: diff <op;ons> <filename1> <filename2>
• For <op;ons> see man
Introduc)on to Bioinforma)cs online course: IBT
15
Linux | Amel Ghouila
Part 2
Outputs redirec)on and combining
different commands
Introduc)on to Bioinforma)cs online course: IBT
16
Linux | Amel Ghouila
Commands outputs
• By default, the standard output of any command will
appear to the terminal screen.
• Redirec;on of the output result to a file is possible.
• This is par;cularly useful for big files
• Syntax: command op;ons filename.in > filename.out
Introduc)on to Bioinforma)cs online course: IBT
17
Linux | Amel Ghouila
Outputs redirec)on
• If the file exists, the result
will be redirected to it
$ cat ghandi.txt
The difference between what we do
and what we are capable of doing
would suffice to solve
most of the world's problems • If the file does not exist, it will be
$ cut -d’ ‘ -f2,3 ghandi.txt
difference between
automa;cally created and the result
what we redirected to it.
suffice to
of the
$ cut -d’ ‘ -f2,3 ghandi.txt > ghandi.txt.out
$ cat ghandi.txt.out
difference between
what we
suffice to
of the
Introduc)on to Bioinforma)cs online course: IBT
18
Linux | Amel Ghouila
Commands combina)on
• The standard output of any command will be one
unique output
• As seen previously, this output can be printed in the
screen or redirected to a file
• However, the output result of a command can also be
redirected to another command
• This is par;cularly useful when several opera;ons are
needed for a file, with no need to store the
intermediate outputs
Introduc)on to Bioinforma)cs online course: IBT
19
Linux | Amel Ghouila
Commands combina)on: example
• Combining several commands is done thanks to the
use of a “|” character
• Structure:
command1 op;ons1 filename1.in |command2 op;ons2 > filename.out
• This can be done for as many commands as needed
Introduc)on to Bioinforma)cs online course: IBT
20
Linux | Amel Ghouila
Thanks
Shaun Aron & Sumir Panji
Introduc)on to Bioinforma)cs online course: IBT
Linux | Amel Ghouila