Frequency Distribution Calculator
Frequency Distribution Calculator is a tool to help you calculate
and analyze word and character frequency distribution in text. You
can use it to calculate word rank, word count, character count,
and letter count. Or to find and analyze Zipfian distributions
(text that follows
Zipf's law). You can even use it to calculate the entropy of text, or
locate hapax legomena in text.
Frequency Distribution Calculator has many features, including
Save and Export
Easily save the results of your analysis to a CSV file.
Analyze Zipfian Distributions
Determine whether or not a given selection of text follows
Estimate the entropy of a given selection of text using
Shannon's entropy calculations.
Hapax Legomena Locator
in text, which are words that appear only once in a given
selection of text.
Count the number of words in text. Analyze the rank and
frequency of words in text.
Count the number of characters in text. Analyze the rank and
frequency of letters and characters in text.
Zipf's law is an empirical law formulated using mathematical statistics that refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. Zipf distribution is related to the zeta distribution, but is not identical.
Zipf's law was originally formulated in terms of quantitative linguistics, stating that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, the word the is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word of accounts for slightly over 3.5% of words (36,411 occurrences), followed by and (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.
The same relationship occurs in many other rankings of human created systems, such as the ranks of mathematical expressions or ranks of notes in music, and even in uncontrolled environments, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel, and so on.
Although Zipf's Law holds for all languages, even non-natural ones like Esperanto, the reason is still not well understood. However, it may be partially explained by the statistical analysis of randomly generated texts. Wentian Li has shown that in a document in which each character has been chosen randomly from a uniform distribution of all letters (plus a space character), the "words" with different lengths follow the macro-trend of the Zipf's law (the more probable words are the shortest with equal probability). Vitold Belevitch in a paper, On the Statistical Laws of Linguistic Distribution offered a mathematical derivation. He took a large class of well-behaved statistical distributions (not only the normal distribution) and expressed them in terms of rank. He then expanded each expression into a Taylor series. In every case Belevitch obtained the remarkable result that a first-order truncation of the series resulted in Zipf's law. Further, a second-order truncation of the Taylor series resulted in Mandelbrot's law.