Have you ever needed to quickly process a text file into some other format… say HTML? Well, I run across this problem more frequently than you might suspect and there is one tool that I use quite often to solve the problem.
AWK is a general purpose programming language that is designed for processing text-based data, either in files or data streams. The name AWK is derived from the surnames of its authors — Alfred Aho, Peter Weinberger, and Brian Kernighan; however, it is not commonly pronounced as a string of separate letters but rather to sound the same as the name of the bird, auk (which acts as an emblem of the language such as on The AWK Programming Language book cover).
Disclaimer: Quoted from Wikipedia
In this HOW-TO I will be discussing a very simple script which converts book data from a tab delimited text format into an HTML file. The data will be presented in tabular form in the output file.
The Code
The BEGIN
section of the AWK program runs as you might have guessed… at the beginning of processing your file or stream. In this section I set the FS
system variable to “\t” which is the escape character sequence for a TAB character. The FS
system variable stands for field separator; this tells the AWK program that my fields are separated by TABs. After this I simply use the print command to output the first parts of the HTML file that I will be generating.
BEGIN { // Set the field separator to tab FS = "t"; // Output the HTML header print "<html>"; print "<head>"; ...
The body of the program (the middle set of curly brackets) contains the real meat of what is going on in the processing. In my tab delimited text I have exported I included column headers as the first line or row of text. This bit of code uses the NR
system variable (number or rows processed) to set my cell type to TH
for the first row and to TD
for each subsequent row.
if(NR == 1) cellType = "th"; else cellType = "td";
Now I use a for loop that goes from 1 (the first field of data) to NF
, the system variable for number of fields. When I exported my tab delimited text some of the fields ended up being quoted because they contained commas. I do not want quotes in my output so I will need to strip these off. I use the match
function to find a regular expression in the field. ^\"
matches a quote at the beginning of the data and \"$
matches a quote at the end of the data. If both of these conditions are met then I simply remove those quotes from the string. Then I print out the table column containing the data.
// Loop over the fields for(i = 1; i <= NF; i++) { // Some of our fields are quoted so we want to trim those myData = $i; if(match(myData, "^\"") && match(myData, "\"$")) { sub(/^"/, "", myData); sub(/"$/, "", myData); } // Print the cell print " <" cellType ">" myData "</" cellType ">"; }
Finally in the END
section I print out the closing of the HTML file. You could also use this section to output statistics for number of rows processed or some aggregated data that you had calculated in the body of the program.
END { // Output the HTML footer print "rn"; }
Conclusion / Download
I hope that this has helped you to see the possibilities of the AWK language for very quickly processing text files. For instance have you ever wanted to insert a file like this into a database, but the database you are using doesn’t support an import feature? This script could very easily be modified to create a SQL file containing a bunch of INSERT statements. Or perhaps you need an XML version of this data?
Here is a download of the complete AWK script and the tab delimited data which I used to test it on. To run simply unzip both files into a directory and execute the command:
> awk -f TabToHtml.awk Books.txt > output.html
Happy awk-ing!
NOTE: This tutorial was written as part of the [GAS] Ultimate How To Contest. They are giving away over $1300 in prizes so head over to see how you can get in the runnings!