data

Linux snippets: sorting a file, ignoring the header

When working with large data files that have a header, sometimes it is more efficient to sort the files for evaluation so that a streaming algorithm can be used. In addition, you may want to simply sort the data that you have by some key for organizational and readability purposes. Regardless, a lot of data preparation involves doing something with data in a delimited file containing a header, while also preserving the position and contents of the header.

Here is a short example that sorts a tab delimited file with a header by the first field in the file:

(head -n 1 data.tsv && tail -n +2 data.tsv  | sort -k1 -t'     ') > data_sorted.tsv
What this command does is spawn a subshell that runs everything in parenthesis, and then outputs it to a second file. Within the parenthesis, we first get the header (head -n 1). Then we run another command that takes everything except the header (tail -n +2) and pipes it to the sort utility. The arguments to sort include the field to sort by (-k1, or the first field in this case) and a delimiter (-t' ', which specifies using tab as a delimiter - you can paste a tab character by typing Ctrl-V followed by Tab). You could substitute whatever routine you want for sort.