Skip to content

Efficiently grouping SAM/BAM file by reference name using Awk

Posted on:September 27, 2023 at 12:13 PM

Target audience

A quick introduction to awk

awk is a program which can be used to efficiently slice and dice structured tabular data. By default,

These two points are enough to get you started with awk. There are a couple of other things you should know too. I’ll use examples to demonstrate them.

Example

Input

Lets use following SAM file as input. Lets call it sample.sam.

@SQ SN:ref  LN:45
@SQ SN:ref2 LN:40
r001    163 ref 7   30  8M4I4M1D3M  =   37  39  TTAGATAAAGAGGATACTG *   XX:B:S,12561,2,20,112
r002    0   ref 9   30  1S2I6M1P1I1P1I4M2I  *   0   0   AAAAGATAAGGGATAAA   *
r001    83  ref 37  30  9M  =   7   -39 CAGCGCCAT   *
x1  0   ref2    1   30  20M *   0   0   aggttttataaaacaaataa    ????????????????????
x2  0   ref2    2   30  21M *   0   0   ggttttataaaacaaataatt   ?????????????????????
x3  0   ref2    6   30  9M4I13M *   0   0   ttataaaacAAATaattaagtctaca  ??????????????????????????

Output

We expect our script to create two files * ref.sam which contains all the records where reference name is ref * ref2.sam which contains all the records where reference name is ref2

Minimal hands on with awk

As mentioned earlier, awk reads and processes one record at a time. For each record, you use $N to refer to Nth field in your record.

Say if you want to extract the CIGAR string for each of the records in the above SAM file, you’ll use $6. Here’s a snippet for the same:

samtools view sample.sam | awk -F'\t' '{print $6}' > all-cigars

The above snippet will extract all the CIGAR strings and put them in a file named all-cigars

Another important thing

As mentioned above, $1 refers to 1st column, $2 to second column etc.

Similarily, $0 refers to whole record

Final Code

Tested on Ubuntu

export samfile="sample.sam" # Replace sample.sam with whatever your SAM/BAM file name is.samtools view ${samfile} | awk -F'\t' '{print $0 > $3".sam"}'

Explanation

So ultimately, we’re simply writing $0, which represents the whole record (a whole line in this case), to $3.sam. Hence, * All the records where $3 is ref will be written to a file named ref.sam * All records where $3 is ref2 will be written to a file named ref2.sam.

Benchmarks