Word counting in the name of Zipf

Many, many years ago I read about Zipf's Law and thought it was something I had to see it for myself. I wrote a word sorting/counting program and put gnuplot to work. I've kept the program alive for those occasions where I think of some other text profiling idea I want to try.
Never before has counting words been so much fun.

Why

What is this "Law of Zipf"?

Wikipedia knows more than I about Zipf's Law and is a good first place to go. However, speaking loosely, Zipf's law states that the frequency of occurrence of any word in a body of natural language is inversely proportional to its rank in the frequency table. Thus, if the most frequent word occurs approximately a times as often as the second most frequent word, that relationship holds true for the second to third, third to fourth, etc. It's a 1/(a^b) relationship where b tends to 1.

This type of distribution is found in a number of natural phenomena such as earthquake magnitudes, moon crater sizes, and many unnatural phenomena also such as city size and corporation sizes. Usually plotted on a log-log graph, it shown a linear relationship between frequency and rank (ranked by size, count, value, etc) of the data.

So one becomes curious

Why does this discrete power law distribution appear in spoken and written languages? The debate is ongoing. But first things first ... does this distribution actually occur? That second question is something I can check easily.

A preview example

So I wrote a program, wrapped it in a Tcl GUI, sorted some words, made some plots. I'll get to the details in the next section, but here is a preview and example of the kind of results one gets. For this example I used a topical text from http://www.apple.com/hotnews/thoughts-on-flash/

Statistics on the text file thoughtsonflash.txt, using the command:
zwc -s -a -d -l -o outfiles -f thoughtsonflash.txt
The total word count is 1698
The unique words number 569
The ratio unique/total is 33.5100 %
The highest ranking 55 words make up 50.236 % of the total, OR
The number of words used once 336
The ratio (words used once)/total is 19.7880 %
The average of the ratios between frequencies is 1.0085

Jump to bottom of list
# ----------------------------------------- # word the # rank count word # ----------------------------------------- 1, 66, the 2, 64, and 3, 36, to 4, 35, flash 5, 29, is 6, 29, of 7, 27, for 8, 26, on 9, 23, we 10, 23, in 11, 21, a 12, 21, that 13, 20, adobe 14, 17, has 15, 17, they 16, 16, apple 17, 15, mobile 18, 15, devices 19, 14, our 20, 14, are 21, 14, it 22, 13, developers 23, 13, open 24, 12, platform 25, 12, web 26, 11, not 27, 11, their 28, 10, many 29, 10, all 30, 10, video 31, 9, third 32, 8, available 33, 8, party 34, 8, have 35, 8, by 36, 8, apps 37, 7, products 38, 7, with 39, 7, websites 40, 7, an 41, 7, from 42, 7, when 43, 7, but 44, 7, touch 45, 7, standards 46, 7, html5 47, 6, there 48, 6, iphones 49, 6, ipods 50, 6, ipads 51, 6, if 52, 6, using 53, 6, adobe's 54, 6, been 55, 6, this 56, 6, want 57, 6, any 58, 6, apple's 59, 6, will 60, 6, more 61, 5, uses 62, 5, h 63, 5, 264 64, 5, there's 65, 5, play 66, 5, be 67, 5, iphone 68, 5, ipad 69, 5, enhancements 70, 5, almost 71, 5, adopted 72, 5, adopt 73, 5, even 74, 5, best 75, 5, was 76, 5, create 77, 5, new 78, 4, browser 79, 4, webkit 80, 4, first 81, 4, half 82, 4, companies 83, 4, than 84, 4, other 85, 4, use 86, 4, proprietary 87, 4, support 88, 4, based 89, 4, software 90, 4, standard 91, 4, only 92, 4, cannot 93, 4, ipod 94, 4, example 95, 4, years 96, 4, now 97, 4, most 98, 4, can 99, 4, app 100, 4, games 101, 4, era 102, 4, pcs 103, 4, too 104, 3, system 105, 3, security 106, 3, 2009 107, 3, say 108, 3, second 109, 3, ship 110, 3, used 111, 3, google 112, 3, youtube 113, 3, battery 114, 3, videos 115, 3, without 116, 3, modern 117, 3, like 118, 3, need 119, 3, fact 120, 3, closed 121, 3, reason 122, 3, also 123, 3, experience 124, 3, one 125, 3, development 126, 3, cross 127, 3, platforms 128, 3, two 129, 3, fully 130, 3, mac 131, 3, them 132, 3, because 133, 3, users 134, 3, why 135, 3, power 136, 3, content 137, 3, store 138, 3, created 139, 3, as 140, 3, 2010 141, 2, were 142, 2, together 143, 2, around 144, 2, creative 145, 2, joint 146, 2, customers 147, 2, since 148, 2, controlled 149, 2, source 150, 2, widely 151, 2, its 152, 2, technology 153, 2, full 154, 2, times 155, 2, others 156, 2, true 157, 2, entertainment 158, 2, titles 159, 2, performance 160, 2, these 161, 2, don't 162, 2, reliability 163, 2, well 164, 2, device 165, 2, few 166, 2, said 167, 2, smartphone 168, 2, then 169, 2, long 170, 2, life 171, 2, much 172, 2, every 173, 2, vimeo 174, 2, netflix 175, 2, recently 176, 2, decoder 177, 2, chips 178, 2, must 179, 2, while 180, 2, hours 181, 2, browsers 182, 2, safari 183, 2, which 184, 2, up 185, 2, over 186, 2, mouse 187, 2, css 188, 2, javascript 189, 2, would 190, 2, rewritten 191, 2, doesn't 192, 2, important 193, 2, do 194, 2, allow 195, 2, run 196, 2, know 197, 2, enhancement 198, 2, at 199, 2, may 200, 2, access 201, 2, set 202, 2, features 203, 2, tool 204, 2, goal 205, 2, help 206, 2, write 207, 2, although 208, 2, 10 209, 2, major 210, 2, developer 211, 2, os 212, 2, x 213, 2, advanced 214, 2, world 215, 2, ever 216, 2, seen 217, 2, so 218, 2, wider 219, 2, customer 220, 2, continually 221, 2, mice 222, 2, business 223, 2, understand 224, 2, beyond 225, 2, low 226, 2, where 227, 2, offering 228, 2, no 229, 2, or 230, 2, 000 231, 2, necessary 232, 2, applications 233, 2, perhaps 234, 2, should 235, 2, great 236, 2, tools 237, 2, future 238, 2, less 239, 1, relationship 240, 1, met 241, 1, founders 242, 1, proverbial 243, 1, garage 244, 1, big 245, 1, adopting 246, 1, postscript 247, 1, language 248, 1, laserwriter 249, 1, printer 250, 1, invested 251, 1, owned 252, 1, 20 253, 1, company 254, 1, worked 255, 1, closely 256, 1, pioneer 257, 1, desktop 258, 1, publishing 259, 1, good 260, 1, golden 261, 1, grown 262, 1, apart 263, 1, went 264, 1, through 265, 1, near 266, 1, death 267, 1, drawn 268, 1, corporate 269, 1, market 270, 1, acrobat 271, 1, today 272, 1, still 273, 1, work 274, 1, serve 275, 1, buy 276, 1, suite 277, 1, interests 278, 1, i 279, 1, wanted 280, 1, jot 281, 1, down 282, 1, some 283, 1, thoughts 284, 1, critics 285, 1, better 286, 1, characterized 287, 1, decision 288, 1, being 289, 1, primarily 290, 1, driven 291, 1, protect 292, 1, reality 293, 1, issues 294, 1, claims 295, 1, opposite 296, 1, let 297, 1, me 298, 1, explain 299, 1, 100 300, 1, sole 301, 1, authority 302, 1, pricing 303, 1, etc 304, 1, does 305, 1, mean 306, 1, entirely 307, 1, definition 308, 1, though 309, 1, operating 310, 1, strongly 311, 1, believe 312, 1, pertaining 313, 1, rather 314, 1, high 315, 1, implementations 316, 1, lets 317, 1, graphics 318, 1, typography 319, 1, animations 320, 1, transitions 321, 1, relying 322, 1, plug 323, 1, ins 324, 1, completely 325, 1, committee 326, 1, member 327, 1, creates 328, 1, began 329, 1, small 330, 1, project 331, 1, complete 332, 1, rendering 333, 1, engine 334, 1, heart 335, 1, android's 336, 1, palm 337, 1, nokia 338, 1, rim 339, 1, blackberry 340, 1, announced 341, 1, microsoft's 342, 1, making 343, 1, repeatedly 344, 1, 75 345, 1, what 346, 1, format 347, 1, viewable 348, 1, estimated 349, 1, 40 350, 1, web's 351, 1, shines 352, 1, bundled 353, 1, discovery 354, 1, viewing 355, 1, add 356, 1, facebook 357, 1, abc 358, 1, cbs 359, 1, cnn 360, 1, msnbc 361, 1, fox 362, 1, news 363, 1, espn 364, 1, npr 365, 1, time 366, 1, york 367, 1, wall 368, 1, street 369, 1, journal 370, 1, sports 371, 1, illustrated 372, 1, people 373, 1, national 374, 1, geographic 375, 1, aren't 376, 1, missing 377, 1, another 378, 1, claim 379, 1, fortunately 380, 1, 50 381, 1, free 382, 1, symantec 383, 1, highlighted 384, 1, having 385, 1, worst 386, 1, records 387, 1, hand 388, 1, number 389, 1, macs 390, 1, crash 391, 1, working 392, 1, fix 393, 1, problems 394, 1, persisted 395, 1, several 396, 1, reduce 397, 1, adding 398, 1, addition 399, 1, performed 400, 1, routinely 401, 1, asked 402, 1, show 403, 1, us 404, 1, performing 405, 1, never 406, 1, publicly 407, 1, early 408, 1, think 409, 1, eventually 410, 1, we're 411, 1, glad 412, 1, didn't 413, 1, hold 414, 1, breath 415, 1, who 416, 1, knows 417, 1, how 418, 1, perform 419, 1, fourth 420, 1, achieve 421, 1, playing 422, 1, decode 423, 1, hardware 424, 1, decoding 425, 1, contain 426, 1, called 427, 1, industry 428, 1, blu 429, 1, ray 430, 1, dvd 431, 1, player 432, 1, added 433, 1, currently 434, 1, requires 435, 1, older 436, 1, generation 437, 1, implemented 438, 1, difference 439, 1, striking 440, 1, decoded 441, 1, 5 442, 1, before 443, 1, drained 444, 1, re 445, 1, encode 446, 1, offer 447, 1, perfectly 448, 1, google's 449, 1, chrome 450, 1, plugins 451, 1, whatsoever 452, 1, look 453, 1, fifth 454, 1, designed 455, 1, screens 456, 1, fingers 457, 1, rely 458, 1, rollovers 459, 1, pop 460, 1, menus 461, 1, elements 462, 1, arrow 463, 1, hovers 464, 1, specific 465, 1, spot 466, 1, revolutionary 467, 1, multi 468, 1, interface 469, 1, concept 470, 1, rollover 471, 1, rewrite 472, 1, technologies 473, 1, ran 474, 1, solve 475, 1, problem 476, 1, sixth 477, 1, besides 478, 1, technical 479, 1, drawbacks 480, 1, discussed 481, 1, downsides 482, 1, interactive 483, 1, wants 484, 1, painful 485, 1, letting 486, 1, layer 487, 1, come 488, 1, between 489, 1, ultimately 490, 1, results 491, 1, sub 492, 1, hinders 493, 1, progress 494, 1, grow 495, 1, dependent 496, 1, libraries 497, 1, take 498, 1, advantage 499, 1, chooses 500, 1, mercy 501, 1, deciding 502, 1, make 503, 1, becomes 504, 1, worse 505, 1, supplying 506, 1, unless 507, 1, supported 508, 1, hence 509, 1, lowest 510, 1, common 511, 1, denominator 512, 1, again 513, 1, accept 514, 1, outcome 515, 1, blocked 516, 1, innovations 517, 1, competitor's 518, 1, painfully 519, 1, slow 520, 1, shipping 521, 1, just 522, 1, cocoa 523, 1, weeks 524, 1, ago 525, 1, shipped 526, 1, cs5 527, 1, last 528, 1, motivation 529, 1, simple 530, 1, provide 531, 1, innovative 532, 1, stand 533, 1, directly 534, 1, shoulders 535, 1, enhance 536, 1, amazing 537, 1, powerful 538, 1, fun 539, 1, useful 540, 1, everyone 541, 1, wins 542, 1, sell 543, 1, reach 544, 1, audience 545, 1, base 546, 1, delighted 547, 1, broadest 548, 1, selection 549, 1, conclusions 550, 1, during 551, 1, pc 552, 1, successful 553, 1, push 554, 1, about 555, 1, interfaces 556, 1, areas 557, 1, falls 558, 1, short 559, 1, avalanche 560, 1, media 561, 1, outlets 562, 1, demonstrates 563, 1, longer 564, 1, watch 565, 1, consume 566, 1, kind 567, 1, 200 568, 1, proves 569, 1, isn't 570, 1, tens 571, 1, thousands 572, 1, graphically 573, 1, rich 574, 1, including 575, 1, such 576, 1, win 577, 1, focus 578, 1, creating 579, 1, criticizing 580, 1, leaving 581, 1, past 582, 1, behind 583, 1, steve 584, 1, jobs 585, 1, april

Jump to top of list

You may notice that the frequency/rank relationship isn't as clearly visible as what may have been suggested by the loose description above. Here are the ratios of the word frequencies between adjacent ranks for the first several and final several entries:

1, 	66, 	the      66/64 =   1.031
2, 	64, 	and      64/36 =   1.778
3, 	36, 	to       36/35 =   1.029
4, 	35, 	flash    35/29 =   1.207
5, 	29, 	is       29/29 =   1.000
6, 	29, 	of       29/27 =   1.074
7, 	27, 	for      27/26 =   1.038
8, 	26, 	on
...
580, 	1, 	leaving   1/1  =   1.000
581, 	1, 	past      1/1  =   1.000
582, 	1, 	behind    1/1  =   1.000
583, 	1, 	steve     1/1  =   1.000
584, 	1, 	jobs      1/1  =   1.000
585, 	1, 	april

Averaging these ratios yields 1.0085, and the graph form above is shown again here along with the curve Ymax*rank^-1.0085 (in red). That's the line for a constant ratio between adjacent ranks' word frequencies.

[Here I chose Ymax =220. In graphs that follow the value of Ymax will be the same as the word frequency for rank 1. The value of 220 here is somewhat arbitrary and chosen only for aesthetic reasons. Curve fitting is less fitting when the number of total words to count is small.]

Back to the top

What

So that was the "Why?" and I seldom need more than curiosity to look into something like this. In this section I will present more of the results, calling attention to some particular features and failures.

Some results

I chose 14 texts and ran the word count program and associated scripts on them. These texts were mostly from Project Gutenberg. The texts were stripped of introductions, editor's notes, historic background, etc., since I was primarily interested in finding out if different authors or different types of texts have a "signature" that is apparent from simple metrics like those shown at the top of thoughts-on-flash example:

The ratio (unique words)/total ("unique words" is the same as the number of ranks).

The number of words making up 50% of the total volume of words.

The ratio (words used once)/total.

The average of the ratios between frequencies.

Raw results

This page provides some raw results as a collection of automatically generated HTML pages, with notes added. The 14 texts used were:

1musk12 The Three Musketeers by Alexandre Dumas.

AstroText Astro-Diagnosis A Guide To Healing by Max Heindel and Augusta Foss Heindel.

asyoulikeit As You Like It by William Shakespeare.

bible Holy Bible King James Version.

callw10 The Call of the Wild by Jack London.

cbook Thinking in C++, 2nd ed. Volume 1 by Bruce Eckel.

extraordinary MEMOIRS OF EXTRAORDINARY POPULAR DELUSIONS AND THE MADNESS OF CROWDS by Charles MACKAY.

koran The Koran

MrFeynman "Surely You're Joking, Mr. Feynman!" by Richard P. Feynman.

nostradamus The Prophecies by Nostradamus.

olivertwist Oliver Twist by Charles Dickens.

origins The Origin of Species by means of Natural Selection, 6th Edition by Charles Darwin.

ovm.txt OVM User Guide, Version 2.1

rime.txt The Rime of the Ancient Mariner by Samuel Taylor Coleridge.

The top 40ish words

The page sidebyside.html shows a side-by-side comparison of the top ranking 42 words from each of the texts. It mostly consists of function words as expected, though there are particularly telling content words also in that top 42 list. I've put some of them in bold.

[The number 42 has no particular relevance¹. It's just where I cut the table length.]

Discussion and the other plots on these pages

What is a "word"?

To a first approximation, a "word" is everything which is separated by non-word tokens, where a non-word token is any of whitespace, comma, period, quote, semicolon, colon, .... the things normally separating words. It's not that simple though.

Are digits "words"?

Some of the texts have a large volume of digits. All kinds of digits. The verses from the bible are numbered. Some texts include page numbers. In those cases are digits to be included? I think not.
But always excluding digits is also problematic since it makes the H.264 video specification in http://www.apple.com/hotnews/thoughts-on-flash/ into 5 instances of "h" (I lower case all words), 100% becomes a null string, and when Richard P Feynman says:

Second, when you have a gear ratio, say 2 to 1, and you are
wondering whether you should make it 10 to 5 or 24 to 12 or
48 to 24, here's how to decide:

then stripping digits translates to:

Second, when you have a gear ratio, say to, and you are
wondering whether you should make it to or to or
to, here's how to decide:

which is nonsense.

My wordcount program has only a single digit handling option, -d, which puts digits on par with a-z,A-Z. Depending on the text, switching the digits option in and out can have a large effect on the total word count, which changes all the ratios that depend on total count.

Hyphens and Hy-
phens

The program includes a hyphen handling option, -s. When a hyphen appears at the end of a line it could be that the word is being split by someone or some program in order to fit the line length.
The program always assumes this to be the case when a hyphen ends a line, and that hyphen is removed. The word becomes the portion ahead of the hyphen joined everything up to the first non-word token of the next line. This means line 1 below is 4 words and line 2 is also 4 words, but that one of the words is now spelt wrong (sixweek). Sometimes it works out though. Lines 3 and 4 are equivalent.

1)
take a six-week vacation
2)
take a six-
week vacation
3)
so much hyper-
bole
4)
so much hyperbole

With the -s option selected all hyphens are removed and lines 1 and 2 are now the same, and both wrong. Some other method could be employed to correct this problem.

Apostrophes

My program considers embedded apostrophes as part of the word. So "don't" is one word, not the two words "don" and "t". But like the hyphen and digits, there are complications. Here are some examples of irritating apostrophe use:

From the copyright page of "Surely You're Joking, Mr. Feynman!":

Portions of this book appeared in /Science '84/ magazine December

The '84 vanishes when combined with the non-use of the -d (include digits) option.

From http://www.apple.com/hotnews/thoughts-on-flash/

I wanted to jot down some of our thoughts on Adobe’s Flash products ...

That's not an ASCII apostrophe in the Flash thoughts. It's an UTF-8 apostrophe and unless UTF-8 is used in the word counting program or unless all of those apostrophe are replaced with ASCII apostrophe, a large collection of single "s" words are counted. The main result there is that I can't just copy a block of text from a web page and give it to the program.

From Oliver Twist:

and a faint voice imperfectly articulated the words, 'Let me see the child, and die.'

In the US the double quote is used to quote, in the UK the single quote is used more often. The single quote is an apostrophe.

These things throw off the word count. In particular, the results for Oliver Twist are suspect since there are a large number of apostrophe related artefacts that shouldn't meet anyone's definition of a unique word. For example:

    3, 	4269, 	'
   56, 	436, 	'i

  134, 	156, 	'and
  254, 	78, 	'but
 1341, 	12, 	'oliver

 4900, 	2, 	again'
 5435, 	2, 	fortun'
 6395, 	1, 	'no'
 6981, 	1, 	didn't'
 7514, 	1, 	'blessed'
 8051, 	1, 	'life-preserver'
 8177, 	1, 	'jemmies'
10246, 	1, 	'frob
10249, 	1, 	i'b
10307, 	1, 	''cod
10690, 	1, 	'hell's
11050, 	1, 	'wot's
11061, 	1, 	'he'
11067, 	1, 	'to-night's

The average of the ratios and a search strategy

As words were read in I kept a sorted list based on the word frequency and used a linear search. My initial impression was that the probability of hitting the most used words was so high that it might be quicker to linearly search the first M words then switch to a binary search for the remainder. M is the Magic number, based on mass of the most used words, that would minimize search time.
Given that the ratio between adjacent ranks' word frequencies is small (eg, rank^-1.0085), I suspect that the best value for M is 1. Looking at the page sidebyside.html, if either strcmp() knew about Zipf's Law, or if it just so happened that there existed no seldom used words between the most used words, then a binary search would be especially fast.

It would also be odd if there really was a constant relationship between the word frequencies between adjacent ranks, and it was a large relationship such as the most used word being used twice as often as the next which was used twice as often as the next, etc, then the average of the ratios would be 2. This would imply that the least used word is used only once, the next used twice, etc., and all "bodies of natural language" would be 2^N-1 long.

WSLNW: The first other plot on these pages

The plots labeled WSLNW are Words Since Last New Word plots (an example is here, and simply counts just what it says it counts. It's a visual indication of the sustained rate of new word production. I did this because I thought it would be an informative metric. But I don't think it has been so.

WRPROF: The second other plot on these pages

The plots labled WRPROF are Word Repetition PRoFile plots (an example is here, and results from a list created, per word, with the current total word count each time the word is encountered. I was unsure as to what I would get out of this, but it is a visual indication of patterns of word repetition.
This proved to be more interesting than the WSLNW plot. Here are examples of things I find interesting from the WRPROF plots for 4 of the texts :

In 1musk12.txt.html there is a vertical strip at about word 17600. Some word appears first about 3/4 way through the text and becomes used frequently from then on. I found it, and this word, this character of Mr. Felton, was supposed to be kept out of the picture for quite some time. It's here:

Then, turning toward the door, and seeing that the young officer
was waiting for his last orders, he said.  "All is well, I thank
you; now leave us alone, Mr. Felton."

Then a new chapter begins. I didn't remember this character. I'll have to read The Three Musketeers again soon.

In callw10.txt.html there is a vertical strip at about word 20000. Some word appears first far into the text and becomes used frequently from then on. This one I guessed easily. It's "Thornton".
There is a Tcl gui which comes with the zwc program. Its chief use is in making it easier to build the HTML summaries, run gnuplot, and to zoom in on regions of interest. Having identified a region of interest at around word rank 20000, the screenshot below shows how the gui is used to zoom in. (Click the image for the bigger picture).
The X Min Value is set at 19600, the X Max Value is set at 20000. With the Add labels? set to No the Y Max Value entry is not read. Clicking on the WRPROF Replot button generates the image shown.

There actually seem to be two strips of sudden and frequent use of a new word there. Setting the Add labels? set to Yes and the Y Max Value to 20000 and clicking on the WRPROF Replot button again generates the new plot shown.

The words are "Thornton's" and "Thornton".

In nostradamus.txt.html there are repeating diagonal lines ... groups of regularly reused words. Or digits maybe:

grep "^35" nostradamus.txt
35
35
35
35
35
35
35
35
35
35

The repeating diagonal lines are a result of the section numbers.

The same thing with the repeating digits should appear in the bible bible.txt.html, but I think it is masked by the volume of words. It's the largest text I used, comprising 1061230 words (with the -d option).
But there are other things to note. For example the features at about 447000 words here:

represent collections of words that start at some point, get a lot of use, then there is a drastic reduction in the rate of the word use. Just what one would expect from a collection of stories. In fact, the arrow points to the word "Psalms", and there are 150 lines of
"Psalms Chapter <chapter number>".

Conclusions to draw

Regarding the distribution of words in bodies of natural language, I have no conclusions at this point. Sometimes it's like that. Zipf got there before I did. To me it was striking to see such an affect. It still is.

Conclusions not to draw

One conclusion that should not be drawn is that there is any meaning to a WCvsRANK plot making a good fit with the line based on the calculated average of the ratios. The best fit I found is The Prophecies of Nostradamus. I don't think that means that the text of the The Prophecies is meaningful.

Back to the top

Further work

This program is getting old and is likely to be abandoned and replaced with something that can take in a standard lexical corpus. There are too may things I want to change. Those are:

Switch to native UTF-8 character encoding.
No hardcoding of non-word tokens. Use a configuration file instead.
The configuration file will include parameters for:
- Determining start and end of a "word",
- Digit handling,
- Hyphen handling,
- Apostrophe handling.

Back to the top

The Code

If you wish to use the program, contact me, or downloaded it: zwc.tar.gz.

¹42. Yes I know.

Why