The Linux Rain Linux General/Gaming News, Reviews and Tutorials

Splitting a File Elegantly

By Bob Mesibov, published 29/08/2014


In a previous Linux Rain article I compared different ways to delete blank lines, and showed that the AWK way was the simplest and most thorough. Here I show how to split a text file into multiple text files using a surprisingly simple AWK command.

Background

In 2013 I published an article in Free Software Magazine about finding changes in a sorted list. The example I used is shown below. It's a made-up name/address/phone-number file called names, with comma-separated fields.

1

The aim in that article was to identify the lines just before (or just after) a change occurred in the last name of the people listed. Those change points are shown above as red lines.

Read the original article to see the trickery I used to find the change points, but it's what might happen next that got me thinking. One of the next things to do would be to split the file into multiple files based on last name. In the FSM article, I did this by marking the change points with a blank line, then using this slightly terrifying AWK command to split the resulting file:

2

But what if the original file wasn't sorted by last name? Is there a simpler way to split the file by last name using AWK, without sorting?

Sheer simplicity

Yes, there is, and it works with both sorted and unsorted lists. It's a clever idea, but not mine — I found it in a response by 'terdon' in a Stack Exchange post.

All you need to do is tell AWK to store the last name in a variable, then append the lines beginning with that name to a new file, naming each new file with the appropriate last name:

3

Watch this command at work on a shuffled version of names:

4

5

What about big files?

Let's try some heavier-duty processing. I'll use the Australian gazetteer file I described in this Linux Rain article. My current version has 374032 lines. Each line has 6 items separated by tabs: (1) placename code, (2) placename, (3) placename type, (4) Australian State or Territory, (5) latitude and 6) longitude. Here are the number of lines for each Australian State or Territory (or other division):

6

We'll do the AWK splitting on these divisions, but there's a gotcha. The entry N/A is going to throw up an error message, because the '/' character isn't allowed in Linux filenames, and we can't name a file 'N/A'. I'll fix that by replacing 'N/A' with uncertain:

7

Now to push the 'Split' button:

8

The result is one file for each division, with just the lines for that division extracted from Oz_gazetteer (compare the screenshot above):

9

And here's a quick validity check, comparing 10 lines from a split-out file with the originals in Oz_gazetteer:

10

Another gotcha

I have a big tab-separated file called 'events' in which field 17 is a year. Can I split events into year files? Yes, but. There are 117 lines in which the year field is blank:

11

If I use the AWK splitter and try to name the split-out files by year, those blanks will go to a nameless file, and that's not possible. What I'll do instead is name the split-out files year_[something]. The lines with blank years will go to the file year_, and this year's lines (for example) will go to year_2014. In the following AWK command, I've avoided the header line by using the condition 'NR>1':

12

Final note

I'm not 100% sure this elegant command will work with all versions of AWK. Mine is GNU Awk 4.0.1.


About the Author

Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.

Tags: linux tutorial scripting shell bash awk files splitting
blog comments powered by Disqus