Word counting in the name of Zipf
Other projects

Many, many years ago I read about Zipf's Law and thought it was something I had to see it for myself. I wrote a word sorting/counting program and put gnuplot to work. I've kept the program alive for those occasions where I think of some other text profiling idea I want to try.
Never before has counting words been so much fun.

Why

What is this "Law of Zipf"?

Wikipedia knows more than I about Zipf's Law and is a good first place to go. However, speaking loosely, Zipf's law states that the frequency of occurrence of any word in a body of natural language is inversely proportional to its rank in the frequency table. Thus, if the most frequent word occurs approximately a times as often as the second most frequent word, that relationship holds true for the second to third, third to fourth, etc. It's a 1/(ab) relationship where b tends to 1.

This type of distribution is found in a number of natural phenomena such as earthquake magnitudes, moon crater sizes, and many unnatural phenomena also such as city size and corporation sizes. Usually plotted on a log-log graph, it shown a linear relationship between frequency and rank (ranked by size, count, value, etc) of the data.

So one becomes curious

Why does this discrete power law distribution appear in spoken and written languages? The debate is ongoing. But first things first ... does this distribution actually occur? That second question is something I can check easily.

A preview example

So I wrote a program, wrapped it in a Tcl GUI, sorted some words, made some plots. I'll get to the details in the next section, but here is a preview and example of the kind of results one gets. For this example I used a topical text from http://www.apple.com/hotnews/thoughts-on-flash/

Statistics on the text file thoughtsonflash.txt, using the command:
zwc -s -a -d -l -o outfiles -f thoughtsonflash.txt
The total word count is 1698
The unique words number 569
The ratio unique/total is 33.5100 %
The highest ranking 55 words make up 50.236 % of the total, OR
The number of words used once 336
The ratio (words used once)/total is 19.7880 %
The average of the ratios between frequencies is 1.0085
Jump to bottom of list
# -----------------------------------------
#     	word 	the                                               
# rank	count 	word                                              
# -----------------------------------------
1, 	66, 	the
2, 	64, 	and
3, 	36, 	to
4, 	35, 	flash
5, 	29, 	is
6, 	29, 	of
7, 	27, 	for
8, 	26, 	on
9, 	23, 	we
10, 	23, 	in
11, 	21, 	a
12, 	21, 	that
13, 	20, 	adobe
14, 	17, 	has
15, 	17, 	they
16, 	16, 	apple
17, 	15, 	mobile
18, 	15, 	devices
19, 	14, 	our
20, 	14, 	are
21, 	14, 	it
22, 	13, 	developers
23, 	13, 	open
24, 	12, 	platform
25, 	12, 	web
26, 	11, 	not
27, 	11, 	their
28, 	10, 	many
29, 	10, 	all
30, 	10, 	video
31, 	9, 	third
32, 	8, 	available
33, 	8, 	party
34, 	8, 	have
35, 	8, 	by
36, 	8, 	apps
37, 	7, 	products
38, 	7, 	with
39, 	7, 	websites
40, 	7, 	an
41, 	7, 	from
42, 	7, 	when
43, 	7, 	but
44, 	7, 	touch
45, 	7, 	standards
46, 	7, 	html5
47, 	6, 	there
48, 	6, 	iphones
49, 	6, 	ipods
50, 	6, 	ipads
51, 	6, 	if
52, 	6, 	using
53, 	6, 	adobe's
54, 	6, 	been
55, 	6, 	this
56, 	6, 	want
57, 	6, 	any
58, 	6, 	apple's
59, 	6, 	will
60, 	6, 	more
61, 	5, 	uses
62, 	5, 	h
63, 	5, 	264
64, 	5, 	there's
65, 	5, 	play
66, 	5, 	be
67, 	5, 	iphone
68, 	5, 	ipad
69, 	5, 	enhancements
70, 	5, 	almost
71, 	5, 	adopted
72, 	5, 	adopt
73, 	5, 	even
74, 	5, 	best
75, 	5, 	was
76, 	5, 	create
77, 	5, 	new
78, 	4, 	browser
79, 	4, 	webkit
80, 	4, 	first
81, 	4, 	half
82, 	4, 	companies
83, 	4, 	than
84, 	4, 	other
85, 	4, 	use
86, 	4, 	proprietary
87, 	4, 	support
88, 	4, 	based
89, 	4, 	software
90, 	4, 	standard
91, 	4, 	only
92, 	4, 	cannot
93, 	4, 	ipod
94, 	4, 	example
95, 	4, 	years
96, 	4, 	now
97, 	4, 	most
98, 	4, 	can
99, 	4, 	app
100, 	4, 	games
101, 	4, 	era
102, 	4, 	pcs
103, 	4, 	too
104, 	3, 	system
105, 	3, 	security
106, 	3, 	2009
107, 	3, 	say
108, 	3, 	second
109, 	3, 	ship
110, 	3, 	used
111, 	3, 	google
112, 	3, 	youtube
113, 	3, 	battery
114, 	3, 	videos
115, 	3, 	without
116, 	3, 	modern
117, 	3, 	like
118, 	3, 	need
119, 	3, 	fact
120, 	3, 	closed
121, 	3, 	reason
122, 	3, 	also
123, 	3, 	experience
124, 	3, 	one
125, 	3, 	development
126, 	3, 	cross
127, 	3, 	platforms
128, 	3, 	two
129, 	3, 	fully
130, 	3, 	mac
131, 	3, 	them
132, 	3, 	because
133, 	3, 	users
134, 	3, 	why
135, 	3, 	power
136, 	3, 	content
137, 	3, 	store
138, 	3, 	created
139, 	3, 	as
140, 	3, 	2010
141, 	2, 	were
142, 	2, 	together
143, 	2, 	around
144, 	2, 	creative
145, 	2, 	joint
146, 	2, 	customers
147, 	2, 	since
148, 	2, 	controlled
149, 	2, 	source
150, 	2, 	widely
151, 	2, 	its
152, 	2, 	technology
153, 	2, 	full
154, 	2, 	times
155, 	2, 	others
156, 	2, 	true
157, 	2, 	entertainment
158, 	2, 	titles
159, 	2, 	performance
160, 	2, 	these
161, 	2, 	don't
162, 	2, 	reliability
163, 	2, 	well
164, 	2, 	device
165, 	2, 	few
166, 	2, 	said
167, 	2, 	smartphone
168, 	2, 	then
169, 	2, 	long
170, 	2, 	life
171, 	2, 	much
172, 	2, 	every
173, 	2, 	vimeo
174, 	2, 	netflix
175, 	2, 	recently
176, 	2, 	decoder
177, 	2, 	chips
178, 	2, 	must
179, 	2, 	while
180, 	2, 	hours
181, 	2, 	browsers
182, 	2, 	safari
183, 	2, 	which
184, 	2, 	up
185, 	2, 	over
186, 	2, 	mouse
187, 	2, 	css
188, 	2, 	javascript
189, 	2, 	would
190, 	2, 	rewritten
191, 	2, 	doesn't
192, 	2, 	important
193, 	2, 	do
194, 	2, 	allow
195, 	2, 	run
196, 	2, 	know
197, 	2, 	enhancement
198, 	2, 	at
199, 	2, 	may
200, 	2, 	access
201, 	2, 	set
202, 	2, 	features
203, 	2, 	tool
204, 	2, 	goal
205, 	2, 	help
206, 	2, 	write
207, 	2, 	although
208, 	2, 	10
209, 	2, 	major
210, 	2, 	developer
211, 	2, 	os
212, 	2, 	x
213, 	2, 	advanced
214, 	2, 	world
215, 	2, 	ever
216, 	2, 	seen
217, 	2, 	so
218, 	2, 	wider
219, 	2, 	customer
220, 	2, 	continually
221, 	2, 	mice
222, 	2, 	business
223, 	2, 	understand
224, 	2, 	beyond
225, 	2, 	low
226, 	2, 	where
227, 	2, 	offering
228, 	2, 	no
229, 	2, 	or
230, 	2, 	000
231, 	2, 	necessary
232, 	2, 	applications
233, 	2, 	perhaps
234, 	2, 	should
235, 	2, 	great
236, 	2, 	tools
237, 	2, 	future
238, 	2, 	less
239, 	1, 	relationship
240, 	1, 	met
241, 	1, 	founders
242, 	1, 	proverbial
243, 	1, 	garage
244, 	1, 	big
245, 	1, 	adopting
246, 	1, 	postscript
247, 	1, 	language
248, 	1, 	laserwriter
249, 	1, 	printer
250, 	1, 	invested
251, 	1, 	owned
252, 	1, 	20
253, 	1, 	company
254, 	1, 	worked
255, 	1, 	closely
256, 	1, 	pioneer
257, 	1, 	desktop
258, 	1, 	publishing
259, 	1, 	good
260, 	1, 	golden
261, 	1, 	grown
262, 	1, 	apart
263, 	1, 	went
264, 	1, 	through
265, 	1, 	near
266, 	1, 	death
267, 	1, 	drawn
268, 	1, 	corporate
269, 	1, 	market
270, 	1, 	acrobat
271, 	1, 	today
272, 	1, 	still
273, 	1, 	work
274, 	1, 	serve
275, 	1, 	buy
276, 	1, 	suite
277, 	1, 	interests
278, 	1, 	i
279, 	1, 	wanted
280, 	1, 	jot
281, 	1, 	down
282, 	1, 	some
283, 	1, 	thoughts
284, 	1, 	critics
285, 	1, 	better
286, 	1, 	characterized
287, 	1, 	decision
288, 	1, 	being
289, 	1, 	primarily
290, 	1, 	driven
291, 	1, 	protect
292, 	1, 	reality
293, 	1, 	issues
294, 	1, 	claims
295, 	1, 	opposite
296, 	1, 	let
297, 	1, 	me
298, 	1, 	explain
299, 	1, 	100
300, 	1, 	sole
301, 	1, 	authority
302, 	1, 	pricing
303, 	1, 	etc
304, 	1, 	does
305, 	1, 	mean
306, 	1, 	entirely
307, 	1, 	definition
308, 	1, 	though
309, 	1, 	operating
310, 	1, 	strongly
311, 	1, 	believe
312, 	1, 	pertaining
313, 	1, 	rather
314, 	1, 	high
315, 	1, 	implementations
316, 	1, 	lets
317, 	1, 	graphics
318, 	1, 	typography
319, 	1, 	animations
320, 	1, 	transitions
321, 	1, 	relying
322, 	1, 	plug
323, 	1, 	ins
324, 	1, 	completely
325, 	1, 	committee
326, 	1, 	member
327, 	1, 	creates
328, 	1, 	began
329, 	1, 	small
330, 	1, 	project
331, 	1, 	complete
332, 	1, 	rendering
333, 	1, 	engine
334, 	1, 	heart
335, 	1, 	android's
336, 	1, 	palm
337, 	1, 	nokia
338, 	1, 	rim
339, 	1, 	blackberry
340, 	1, 	announced
341, 	1, 	microsoft's
342, 	1, 	making
343, 	1, 	repeatedly
344, 	1, 	75
345, 	1, 	what
346, 	1, 	format
347, 	1, 	viewable
348, 	1, 	estimated
349, 	1, 	40
350, 	1, 	web's
351, 	1, 	shines
352, 	1, 	bundled
353, 	1, 	discovery
354, 	1, 	viewing
355, 	1, 	add
356, 	1, 	facebook
357, 	1, 	abc
358, 	1, 	cbs
359, 	1, 	cnn
360, 	1, 	msnbc
361, 	1, 	fox
362, 	1, 	news
363, 	1, 	espn
364, 	1, 	npr
365, 	1, 	time
366, 	1, 	york
367, 	1, 	wall
368, 	1, 	street
369, 	1, 	journal
370, 	1, 	sports
371, 	1, 	illustrated
372, 	1, 	people
373, 	1, 	national
374, 	1, 	geographic
375, 	1, 	aren't
376, 	1, 	missing
377, 	1, 	another
378, 	1, 	claim
379, 	1, 	fortunately
380, 	1, 	50
381, 	1, 	free
382, 	1, 	symantec
383, 	1, 	highlighted
384, 	1, 	having
385, 	1, 	worst
386, 	1, 	records
387, 	1, 	hand
388, 	1, 	number
389, 	1, 	macs
390, 	1, 	crash
391, 	1, 	working
392, 	1, 	fix
393, 	1, 	problems
394, 	1, 	persisted
395, 	1, 	several
396, 	1, 	reduce
397, 	1, 	adding
398, 	1, 	addition
399, 	1, 	performed
400, 	1, 	routinely
401, 	1, 	asked
402, 	1, 	show
403, 	1, 	us
404, 	1, 	performing
405, 	1, 	never
406, 	1, 	publicly
407, 	1, 	early
408, 	1, 	think
409, 	1, 	eventually
410, 	1, 	we're
411, 	1, 	glad
412, 	1, 	didn't
413, 	1, 	hold
414, 	1, 	breath
415, 	1, 	who
416, 	1, 	knows
417, 	1, 	how
418, 	1, 	perform
419, 	1, 	fourth
420, 	1, 	achieve
421, 	1, 	playing
422, 	1, 	decode
423, 	1, 	hardware
424, 	1, 	decoding
425, 	1, 	contain
426, 	1, 	called
427, 	1, 	industry
428, 	1, 	blu
429, 	1, 	ray
430, 	1, 	dvd
431, 	1, 	player
432, 	1, 	added
433, 	1, 	currently
434, 	1, 	requires
435, 	1, 	older
436, 	1, 	generation
437, 	1, 	implemented
438, 	1, 	difference
439, 	1, 	striking
440, 	1, 	decoded
441, 	1, 	5
442, 	1, 	before
443, 	1, 	drained
444, 	1, 	re
445, 	1, 	encode
446, 	1, 	offer
447, 	1, 	perfectly
448, 	1, 	google's
449, 	1, 	chrome
450, 	1, 	plugins
451, 	1, 	whatsoever
452, 	1, 	look
453, 	1, 	fifth
454, 	1, 	designed
455, 	1, 	screens
456, 	1, 	fingers
457, 	1, 	rely
458, 	1, 	rollovers
459, 	1, 	pop
460, 	1, 	menus
461, 	1, 	elements
462, 	1, 	arrow
463, 	1, 	hovers
464, 	1, 	specific
465, 	1, 	spot
466, 	1, 	revolutionary
467, 	1, 	multi
468, 	1, 	interface
469, 	1, 	concept
470, 	1, 	rollover
471, 	1, 	rewrite
472, 	1, 	technologies
473, 	1, 	ran
474, 	1, 	solve
475, 	1, 	problem
476, 	1, 	sixth
477, 	1, 	besides
478, 	1, 	technical
479, 	1, 	drawbacks
480, 	1, 	discussed
481, 	1, 	downsides
482, 	1, 	interactive
483, 	1, 	wants
484, 	1, 	painful
485, 	1, 	letting
486, 	1, 	layer
487, 	1, 	come
488, 	1, 	between
489, 	1, 	ultimately
490, 	1, 	results
491, 	1, 	sub
492, 	1, 	hinders
493, 	1, 	progress
494, 	1, 	grow
495, 	1, 	dependent
496, 	1, 	libraries
497, 	1, 	take
498, 	1, 	advantage
499, 	1, 	chooses
500, 	1, 	mercy
501, 	1, 	deciding
502, 	1, 	make
503, 	1, 	becomes
504, 	1, 	worse
505, 	1, 	supplying
506, 	1, 	unless
507, 	1, 	supported
508, 	1, 	hence
509, 	1, 	lowest
510, 	1, 	common
511, 	1, 	denominator
512, 	1, 	again
513, 	1, 	accept
514, 	1, 	outcome
515, 	1, 	blocked
516, 	1, 	innovations
517, 	1, 	competitor's
518, 	1, 	painfully
519, 	1, 	slow
520, 	1, 	shipping
521, 	1, 	just
522, 	1, 	cocoa
523, 	1, 	weeks
524, 	1, 	ago
525, 	1, 	shipped
526, 	1, 	cs5
527, 	1, 	last
528, 	1, 	motivation
529, 	1, 	simple
530, 	1, 	provide
531, 	1, 	innovative
532, 	1, 	stand
533, 	1, 	directly
534, 	1, 	shoulders
535, 	1, 	enhance
536, 	1, 	amazing
537, 	1, 	powerful
538, 	1, 	fun
539, 	1, 	useful
540, 	1, 	everyone
541, 	1, 	wins
542, 	1, 	sell
543, 	1, 	reach
544, 	1, 	audience
545, 	1, 	base
546, 	1, 	delighted
547, 	1, 	broadest
548, 	1, 	selection
549, 	1, 	conclusions
550, 	1, 	during
551, 	1, 	pc
552, 	1, 	successful
553, 	1, 	push
554, 	1, 	about
555, 	1, 	interfaces
556, 	1, 	areas
557, 	1, 	falls
558, 	1, 	short
559, 	1, 	avalanche
560, 	1, 	media
561, 	1, 	outlets
562, 	1, 	demonstrates
563, 	1, 	longer
564, 	1, 	watch
565, 	1, 	consume
566, 	1, 	kind
567, 	1, 	200
568, 	1, 	proves
569, 	1, 	isn't
570, 	1, 	tens
571, 	1, 	thousands
572, 	1, 	graphically
573, 	1, 	rich
574, 	1, 	including
575, 	1, 	such
576, 	1, 	win
577, 	1, 	focus
578, 	1, 	creating
579, 	1, 	criticizing
580, 	1, 	leaving
581, 	1, 	past
582, 	1, 	behind
583, 	1, 	steve
584, 	1, 	jobs
585, 	1, 	april
Jump to top of list

You may notice that the frequency/rank relationship isn't as clearly visible as what may have been suggested by the loose description above. Here are the ratios of the word frequencies between adjacent ranks for the first several and final several entries:

1, 	66, 	the      66/64 =   1.031
2, 	64, 	and      64/36 =   1.778
3, 	36, 	to       36/35 =   1.029
4, 	35, 	flash    35/29 =   1.207
5, 	29, 	is       29/29 =   1.000
6, 	29, 	of       29/27 =   1.074
7, 	27, 	for      27/26 =   1.038
8, 	26, 	on
...
580, 	1, 	leaving   1/1  =   1.000
581, 	1, 	past      1/1  =   1.000
582, 	1, 	behind    1/1  =   1.000
583, 	1, 	steve     1/1  =   1.000
584, 	1, 	jobs      1/1  =   1.000
585, 	1, 	april
Averaging these ratios yields 1.0085, and the graph form above is shown again here along with the curve Ymax*rank-1.0085 (in red). That's the line for a constant ratio between adjacent ranks' word frequencies.

[Here I chose Ymax =220. In graphs that follow the value of Ymax will be the same as the word frequency for rank 1. The value of 220 here is somewhat arbitrary and chosen only for aesthetic reasons. Curve fitting is less fitting when the number of total words to count is small.]


Back to the top


What

So that was the "Why?" and I seldom need more than curiosity to look into something like this. In this section I will present more of the results, calling attention to some particular features and failures.

Some results

I chose 14 texts and ran the word count program and associated scripts on them. These texts were mostly from Project Gutenberg. The texts were stripped of introductions, editor's notes, historic background, etc., since I was primarily interested in finding out if different authors or different types of texts have a "signature" that is apparent from simple metrics like those shown at the top of thoughts-on-flash example:
  • The ratio (unique words)/total ("unique words" is the same as the number of ranks).
  • The number of words making up 50% of the total volume of words.
  • The ratio (words used once)/total.
  • The average of the ratios between frequencies.

    Raw results

    This page provides some raw results as a collection of automatically generated HTML pages, with notes added. The 14 texts used were:
  • 1musk12 The Three Musketeers by Alexandre Dumas.
  • AstroText Astro-Diagnosis A Guide To Healing by Max Heindel and Augusta Foss Heindel.
  • asyoulikeit As You Like It by William Shakespeare.
  • bible Holy Bible King James Version.
  • callw10 The Call of the Wild by Jack London.
  • cbook Thinking in C++, 2nd ed. Volume 1 by Bruce Eckel.
  • extraordinary MEMOIRS OF EXTRAORDINARY POPULAR DELUSIONS AND THE MADNESS OF CROWDS by Charles MACKAY.
  • koran The Koran
  • MrFeynman "Surely You're Joking, Mr. Feynman!" by Richard P. Feynman.
  • nostradamus The Prophecies by Nostradamus.
  • olivertwist Oliver Twist by Charles Dickens.
  • origins The Origin of Species by means of Natural Selection, 6th Edition by Charles Darwin.
  • ovm.txt OVM User Guide, Version 2.1
  • rime.txt The Rime of the Ancient Mariner by Samuel Taylor Coleridge.

    The top 40ish words

    The page sidebyside.html shows a side-by-side comparison of the top ranking 42 words from each of the texts. It mostly consists of function words as expected, though there are particularly telling content words also in that top 42 list. I've put some of them in bold.

    [The number 42 has no particular relevance1. It's just where I cut the table length.]

    Discussion and the other plots on these pages

    What is a "word"?

    To a first approximation, a "word" is everything which is separated by non-word tokens, where a non-word token is any of whitespace, comma, period, quote, semicolon, colon, .... the things normally separating words. It's not that simple though.

    Are digits "words"?

    Some of the texts have a large volume of digits. All kinds of digits. The verses from the bible are numbered. Some texts include page numbers. In those cases are digits to be included? I think not.
    But always excluding digits is also problematic since it makes the H.264 video specification in http://www.apple.com/hotnews/thoughts-on-flash/ into 5 instances of "h" (I lower case all words), 100% becomes a null string, and when Richard P Feynman says:
    Second, when you have a gear ratio, say 2 to 1, and you are
    wondering whether you should make it 10 to 5 or 24 to 12 or
    48 to 24, here's how to decide:
    
    then stripping digits translates to:
    Second, when you have a gear ratio, say to, and you are
    wondering whether you should make it to or to or
    to, here's how to decide:
    
    which is nonsense.

    My wordcount program has only a single digit handling option, -d, which puts digits on par with a-z,A-Z. Depending on the text, switching the digits option in and out can have a large effect on the total word count, which changes all the ratios that depend on total count.

    Hyphens and Hy-
    phens

    The program includes a hyphen handling option, -s. When a hyphen appears at the end of a line it could be that the word is being split by someone or some program in order to fit the line length.
    The program always assumes this to be the case when a hyphen ends a line, and that hyphen is removed. The word becomes the portion ahead of the hyphen joined everything up to the first non-word token of the next line. This means line 1 below is 4 words and line 2 is also 4 words, but that one of the words is now spelt wrong (sixweek). Sometimes it works out though. Lines 3 and 4 are equivalent.
    1)
    take a six-week vacation
    2)
    take a six-
    week vacation
    3)
    so much hyper-
    bole
    4)
    so much hyperbole
    

    With the -s option selected all hyphens are removed and lines 1 and 2 are now the same, and both wrong. Some other method could be employed to correct this problem.

    Apostrophes

    My program considers embedded apostrophes as part of the word. So "don't" is one word, not the two words "don" and "t". But like the hyphen and digits, there are complications. Here are some examples of irritating apostrophe use:

    From the copyright page of "Surely You're Joking, Mr. Feynman!":

    Portions of this book appeared in /Science '84/ magazine December
    
    The '84 vanishes when combined with the non-use of the -d (include digits) option.

    From http://www.apple.com/hotnews/thoughts-on-flash/

    I wanted to jot down some of our thoughts on Adobe’s Flash products ...
    
    That's not an ASCII apostrophe in the Flash thoughts. It's an UTF-8 apostrophe and unless UTF-8 is used in the word counting program or unless all of those apostrophe are replaced with ASCII apostrophe, a large collection of single "s" words are counted. The main result there is that I can't just copy a block of text from a web page and give it to the program.

    From Oliver Twist:

    and a faint voice imperfectly articulated the words, 'Let me see the child, and die.'
    
    In the US the double quote is used to quote, in the UK the single quote is used more often. The single quote is an apostrophe.

    These things throw off the word count. In particular, the results for Oliver Twist are suspect since there are a large number of apostrophe related artefacts that shouldn't meet anyone's definition of a unique word. For example:

        3, 	4269, 	'
       56, 	436, 	'i
    
      134, 	156, 	'and
      254, 	78, 	'but
     1341, 	12, 	'oliver
    
     4900, 	2, 	again'
     5435, 	2, 	fortun'
     6395, 	1, 	'no'
     6981, 	1, 	didn't'
     7514, 	1, 	'blessed'
     8051, 	1, 	'life-preserver'
     8177, 	1, 	'jemmies'
    10246, 	1, 	'frob
    10249, 	1, 	i'b
    10307, 	1, 	''cod
    10690, 	1, 	'hell's
    11050, 	1, 	'wot's
    11061, 	1, 	'he'
    11067, 	1, 	'to-night's
    

    The average of the ratios and a search strategy

    As words were read in I kept a sorted list based on the word frequency and used a linear search. My initial impression was that the probability of hitting the most used words was so high that it might be quicker to linearly search the first M words then switch to a binary search for the remainder. M is the Magic number, based on mass of the most used words, that would minimize search time.
    Given that the ratio between adjacent ranks' word frequencies is small (eg, rank-1.0085), I suspect that the best value for M is 1. Looking at the page sidebyside.html, if either strcmp() knew about Zipf's Law, or if it just so happened that there existed no seldom used words between the most used words, then a binary search would be especially fast.

    It would also be odd if there really was a constant relationship between the word frequencies between adjacent ranks, and it was a large relationship such as the most used word being used twice as often as the next which was used twice as often as the next, etc, then the average of the ratios would be 2. This would imply that the least used word is used only once, the next used twice, etc., and all "bodies of natural language" would be 2N-1 long.

    WSLNW: The first other plot on these pages

    The plots labeled WSLNW are Words Since Last New Word plots (an example is here, and simply counts just what it says it counts. It's a visual indication of the sustained rate of new word production. I did this because I thought it would be an informative metric. But I don't think it has been so.

    WRPROF: The second other plot on these pages

    The plots labled WRPROF are Word Repetition PRoFile plots (an example is here, and results from a list created, per word, with the current total word count each time the word is encountered. I was unsure as to what I would get out of this, but it is a visual indication of patterns of word repetition.
    This proved to be more interesting than the WSLNW plot. Here are examples of things I find interesting from the WRPROF plots for 4 of the texts :

    In 1musk12.txt.html there is a vertical strip at about word 17600. Some word appears first about 3/4 way through the text and becomes used frequently from then on. I found it, and this word, this character of Mr. Felton, was supposed to be kept out of the picture for quite some time. It's here:

    Then, turning toward the door, and seeing that the young officer
    was waiting for his last orders, he said.  "All is well, I thank
    you; now leave us alone, Mr. Felton."
    
    Then a new chapter begins. I didn't remember this character. I'll have to read The Three Musketeers again soon.

    In callw10.txt.html there is a vertical strip at about word 20000. Some word appears first far into the text and becomes used frequently from then on. This one I guessed easily. It's "Thornton".
    There is a Tcl gui which comes with the zwc program. Its chief use is in making it easier to build the HTML summaries, run gnuplot, and to zoom in on regions of interest. Having identified a region of interest at around word rank 20000, the screenshot below shows how the gui is used to zoom in. (Click the image for the bigger picture).
    The X Min Value is set at 19600, the X Max Value is set at 20000. With the Add labels? set to No the Y Max Value entry is not read. Clicking on the WRPROF Replot button generates the image shown.


    There actually seem to be two strips of sudden and frequent use of a new word there. Setting the Add labels? set to Yes and the Y Max Value to 20000 and clicking on the WRPROF Replot button again generates the new plot shown.


    The words are "Thornton's" and "Thornton".

    In nostradamus.txt.html there are repeating diagonal lines ... groups of regularly reused words. Or digits maybe:

    grep "^35" nostradamus.txt
    35
    35
    35
    35
    35
    35
    35
    35
    35
    35
    
    The repeating diagonal lines are a result of the section numbers.

    The same thing with the repeating digits should appear in the bible bible.txt.html, but I think it is masked by the volume of words. It's the largest text I used, comprising 1061230 words (with the -d option).
    But there are other things to note. For example the features at about 447000 words here:

    represent collections of words that start at some point, get a lot of use, then there is a drastic reduction in the rate of the word use. Just what one would expect from a collection of stories. In fact, the arrow points to the word "Psalms", and there are 150 lines of
    "Psalms Chapter <chapter number>".

    Conclusions to draw

    Regarding the distribution of words in bodies of natural language, I have no conclusions at this point. Sometimes it's like that. Zipf got there before I did. To me it was striking to see such an affect. It still is.

    Conclusions not to draw

    One conclusion that should not be drawn is that there is any meaning to a WCvsRANK plot making a good fit with the line based on the calculated average of the ratios. The best fit I found is The Prophecies of Nostradamus. I don't think that means that the text of the The Prophecies is meaningful.

    Back to the top


    Further work

    This program is getting old and is likely to be abandoned and replaced with something that can take in a standard lexical corpus. There are too may things I want to change. Those are: Back to the top



    The Code

    If you wish to use the program, contact me, or downloaded it: zwc.tar.gz.
    142. Yes I know.