Archive for January, 2007

Text justification (ASCII)

January 15, 2007

In my hosting server (or should I say my sdf Unix-shell account) there is a nice bboard, one of whose groups is “HELPDESK”. There people come and go looking for and receiving help. It is a helpful and amiable area, in which I have learnt a lot (and which has served me also to refresh forgotten ideas or solutions to old problems).

About a week ago, someone asked there for a Unix utility to word-wrap files. For whatever reason (probably lack of attention) I understood he was looking for a justification utility, that is, a script to reformat paragraphs so that all the lines have the same length. You certainly know what it means, but the wikipedia explains it, just in case. The point is: I decided to do it, as an interesting prloblem in simple functional programming. I wanted also to include several options (like not justifying lines ending in “dots”, or choosing the distribution of spaces in the lines -rightwards, leftwards or randomly-, etc…). I came out with this (which includes a lorem ipsum text for demo purposes and a little “joiner” program as a plus).

In this entry, I am commenting the main loop and the justification subroutine:

The loop is (I have taken away the code for special cases):

  1 while(<>) {
2 chomp ;
3 $line .= $_ ;
4 while($line) {
5 if ($start != 1) {
6 print $PREPEND ;
7 }
8 ($output, $line) = justify($start, $line);
9 # ONLY PRINT THE OUTPUT if the line goes on, otherwise,
10 # we need to adjust the loose line, in case it has to be justified.
11 if ($line) {
12 print $output ;
13 $start = 0;
14 }
15 }
16 # chomp the last part of
the line and process it again, otherwise,
17 # loose lines were always printed verbatim (which is
18 # not necessarily ustify($start, $output ) ;
21 print $output ;
22

It is quite simple, as you see. The only “idea” in it is to have the justification routine return not only a line, but both the line and the outstanding text. This makes it possible to loop on $line (at line 4 in the code above), taking advantage of the call in line 8 which gets the true “output” for the present line and sets $line to what remains to format. This loop is repeated until there is no remaining output. (this is the while($line) in line 4. Finally, the last remaining output *needs* to be processed, (it will be what is called a “loose line” and the user may or may not want to process it, according to the command-line parameters).The “justify” routine thakes a long line as input (and a flag telling it whether it is parsing the start of a paragraph or not) and returns two strings: a completely justified line + the outstanding text -there is some special code to deal with loose lines and for several different user preferences).

Interesting chunks are:

23 # inside the "justify()" subroutine 24 #25 $local_line = join(" ", my @words =
split(/s+/,$local_line)) ;

which in a single line takes away all the repeated spaces (and the trailing and starting ones) from $local_line (a copy of the input line of text) [this is done with split], saves into @words a list of each “word” (anything not containing spaces in it) and joins all the words again putting single spaces in between [join].
Then we pop words from the @words list and put them inside $overfull, which will contain the outstanding text:

26 # (...) 27 while($#words and length ("@words") > $col_width) {28 $overfull = (pop @words) . " " . $overfull ;
29 }

Then, if there are remaining @words (this will happen unless there is just ONE word in $local_line of length greater than the line width), distribute spaces as evenly as possible between words:

30 # (...) 31 if ($#words) {32 my $free_space = $col_width - length(join ("", @words)) ;
33 $space = " " x int($free_space / $#words) ;

Keep the remaining spaces in @last space to be distributed later on according to the user’s preferences. The keys of array %spaces are exactly the places where these spaces will appear.

34 @last_space = (" ") x ($free_space % $#words) ; 35 %spaces = ();36
37 for(my $i = 0; $i <= $#last_space; $i++) {
38 # distribute outstanding space according to user's prefs
39 $spaces{$j} = " ";
40 }
41 my $i = 0 ;
42

Here the output is “written” word by word inserting as much space as the algorithm has computed after each word. Notice how we do this for all the @words but the last one, which get special treatment, as it may be the only one in the line.

43 # join words + spaces 44 foreach my $word (0..$#words - 1){45 $output .= $words[$word] . $space ;
46 $output .= ($spaces{$i} ? pop @last_space : "" ) ;
47 $i++ ;
48 }
49 }
50

Finally, insert the last word into $output, with space before if $output is already non-null or without it if there is no output still.

52 $output .= ($output ? "@last_space" . $words[$#words] : 53 $words[$#words] ) . "n" ;54 return ($output, $overfull) ;
55 return ($output, $overfull) ;

This is all. Comments are welcome and remember, you can download the code and do as you please with it. But don’t blame me.