Preface
It has been six months or more since I started writing what follows. This will be another history post. I aim to go from De Moivre's work to entropy and then to the loss functions typically used in Machine Learning, with several deviations along the way. Of course, it is not all in one post; this one started with the normal distribution. Especially De Moivre, why? Contemporary sources are so easy to read and interpret and more readily available. While the notation is not the same as today, it is not hard to decipher.
In particular, today, the original day of this post, March 14th, the main source I use, De Moivre's book on Chance, does not have the Pi symbol. Instead, as most books around this time, they used “natural units;” the circumference of a circle of radius unity. The symbol, however, would vary from c to another Latin letter, depending on whether it was convenient to divide by some other integer.
Potential paths to reach here are Boltzmann's mini autobiography or a random walk from Mendel’s post; I will connect future posts along the way.
Intro
One of the most interesting aspects of entropy, information theory, statistics, and volumes in n-dimensional spaces is due to the Gaussian distributions, also known as the normal distributions. How do these come about? What is special about them, and why are they found in deep learning, physics, and biology? Their presences are often tied to approximation and the convenient computational properties of natural exponentials and logarithms. These properties are due to the conformal nature of 1/x.
The normal or Gaussian distribution first comes from our understanding of the analytical properties n!
that De Moivre first studied. De Moivre first used it in his book “The Doctrine of Chance,” published in 17161. However, he didn't introduce them through exponentials of e; nope, he used the inverse of a special logarithm known as the hyperbolic logarithm. This function was known to come from the integration of 1/x.
Today, we call them natural logarithms, but back then, we had yet to understand the relationship between logarithms and the base of exponents, especially the exponents of Euler’s constant e. Whose name implies its understanding under Euler several decades later.
The very Doctrine that finds Chance where it really is, being able to prove by a gradual increase of Probability, till it arrives at Demonstration, that where Uniformity, Order and Constancy reside, there also reside Choice and Design. — De Moivre
In the case of large numbers, the logarithm of the factorial has a beneficial relationship to the logarithm without factorials. Most students are introduced to this fact as Stirling’s approximation while taking Calculus and quickly noted by a Physics teacher. However, De Moivre understood this relationship albeit a constant, shortly or simultaneously as Stirlings. Since he was mostly computing ratios of frequency of events, there was no need to find constants, though I believe he gave some numerical approximation. By the second edition, Stirling sent a letter to De Moivre containing his derivation, and the latter included as an appendix, along with a large table of logarithms of n!
Before continuing, let’s place ourselves in the late 1600s.
The Late 80’s
“He [De Moivre] knows all these things better than I do.” — Newton
After over a century of religious wars due to the spread of Protestantism throughout Europe, the 17th century created this perfect storm. In France, decades of tolerance of Huguenots came to a halt during the reign of Louis XIV, negating the treaty of Nantes, leading to the migration/deportation/death of thousands of Huguenots, including De Moivre and his family.
In England, however, the 17th century marked the very short and temporary return of catholic kings. James II almost revived catholicism, attempting to purge most of the Puritan laws enacted during Oliver Cromwell’s leadership. Though he failed, this seems to have led to gambling becoming less taboo; as we can see, the introduction of Exchanges allowed Inverstor to use forward contracts legally, along with other securities. Until then, most courts in Europe would have a hard time deciding on their legal and binding, given its relationship to gambling.
Mathematics and physics were experiencing their own revolutions. Over a century, mathematicians slowly approached continuous results for a realm previously dominated by geometry, punctuated by Newton’s Principia in 1687. This led to previous problems suddenly being trackable, not only in mathematics but also in physics and economics, among numerous other fields.
De Moivre was in England by 1687, and so on, through letters of introduction (this is how we vouched for people when identity was difficult to ascertain); he ended up in the presence of a copy of the Principia. This seems to have fascinated him; according to the numerous anecdotes regarding him, he would rip pages of the Principia to read them on this way to tutoring across the city.
Coffee Houses
The first coffee houses in London were inaugurated in the 1650s; by the time De Moivre arrived in England, several hundred were present within London alone. Before coffeehouses, gatherings primarily occurred in alehouses, where discussions and business dealings occurred amidst noise and rowdiness. Coffeehouses emerged as quieter, more composed alternatives, fostering serious conversations among men.
The topics discussed and those who frequented generally followed the location; e.g. those close to Oxford were frequented by academics, while by the Stock Exchange by businessmen, etc. Coffee houses didn’t seem to restrict who could come, except for gender, though I’m sure there was plenty of racism and classism sprinkled in there. Some were often called “penny universities” due to the admission cost being a cup of coffee, a penny, and the knowledge one could gain by engaging with the clientele.
"The coffeehouse was a place for like-minded scholars to congregate, to read, as well as learn from and to debate with each other, but was emphatically not a university institution, and the discourse there was of a far different order than any university tutorial” —Cowan
De Moivre supplemented his income with private tutoring, playing chess, and advising gamblers and speculators in London and its many coffee houses. He seemed to want an academic job, but his chances were slim, with no published works under his name and being French in England. First, he added `De` to his name to seem more upper-class. Second, he learned enough about Newton’s analytical methods and soon got a paper published on the subject, sponsored by Edmond Halley, shortly after meeting him in 1692.
As the century turned, he worked with Halley and published at least two works on planetary orbits, along with work on multinomial series and combinations. Shortly after, the publication of de Montmort’s 1710 analysis on games of risk (“Essay d’analyse sure less jeux de hazard”). A friend proposed some extensions of the problems found there to De Moivre; we can imagine it happened in one of these coffee houses. He noticed his work in combinations was different enough for him to publish them.
One final note: there is still much discussion about whether or not he made money gambling. He consulted for speculative endeavors in England and throughout Europe, and he didn’t seem greedy with his money, often gifting books and copies of his articles printed with his own money.
A game of chance.
By the 1690s, risk was all the rage, from games of chance to speculation. While securities and stocks already existed, one of the oldest securities trading was the London Royal Exchange, founded in the 1690s.
Games of chance, such as throwing dice, were often the first attempts to define something like probability. In the 1560s, Cardano, and even Galileo, took stabs at formulating these problems into combinatorics, especially Cardano, given his understanding of binomial coefficients to the cube or quadrature,
Nonetheless, De Moivre tried to distance himself from gambling and games of chance by positing his work to discourage it and simply providing tools to understand the risks involved. Nonetheless, his book's first seven “Use Cases” focus on the probability of drawing aces from dice!
Bernoulli was the first to realize the Weak Law of large numbers: If a distribution has a true mean, the mean of a sample will only approach the true mean when the number of samples is large. His work, accomplished during the same time as Newton’s Principia, was only published post-humorously in 1617, shortly before De Moivre's book and after many of his technical articles were published. Bernoulli's work, though, is the foundation of much of the frequentist definition of probability.
De Moivre does a great description of the previous large N work,
Not so much an approximation as determining very wide limits; sounds almost like throwing shade.
This relationship between combinatorics and coefficients of binomials (x+y)^n forms a thread through most of the work in probability during this time, from algebra to the work of Huygens, Bernoulli, Newton, and De Moivre. This allowed them to obtain analytical formulations of the problems by often looking at coefficients of (1+1)^n, where the expansion into coefficients is done before replacing x and y with 1.
For example, if you wanted the number of possible ways to choose 2 objects from a set of 6 objects, you would look at the (a+b)^6 binomial and assign a with the event of choosing and b with the event of not choosing
The total possible way to choose 2 would then be the coefficient of a^2 b^4, which makes sense since you have 6 ways to choose the first object, and there are 5 ways to choose the second object since the order in which you choose the objects doesn’t matter: first a 5, then a 1, or first a 1, then 5, you divide by 2 for overcounting, and thus 15. Therefore, figuring out combinatorics and computing binomial coefficients are similar computational tasks.
In the preface, he even identifies the main tools he used from Principia.
“The general Theorem invented by Sir Isaac Newton raising a Binomial to any power given, facilitates infinitely the Method of combinations, representing in me View the Combination of all the Chances, that can happen in any given Number of Time. […] Another Method I have made use of, is that of Infinite Series, which in many cases will solve the Problems of Chance more naturally than Combinations. —De Moivre 2
The first one was what we today would call Taylor’s expansion of the binomial using fluxions, i.e., derivatives, about a point. See, Newton did something curious. He assumed the binomial (a+b)^n can be expanded for any real p value. For example, What if n was negative or fractional? Certainly, the formula above is, at least at the time, undefined. We know that ratios of a/b between -1 and 1 should converge to a real value. So Newton proceeded to go forth and do what we, these days, would call an analytical continuation, though back then it would be “assume it works and get a reasonable result, and justify after the fact.” and find a formula that would be useful for negative n. With n→ -n :
Why do we need negative powers when the binomials we are interested in combinatorics are all positive? Say you want to compute the central binomial coefficient, e.g., the probability an object will be found in the highest probability. One interesting problem is the probability of finding an object in the starting position of a random walk. In the case above, this coefficient can be written as:
As n gets large, you will end up with large polynomial powers. De Moivre then shows that it is given by
He used both the binomial expansion and the infinite series interpretation for n!
After crediting Sterling for the constants, he shows that this series can be related to pi and gives an alternate form of the series in terms of Bernoulli’s original numbers. Note that these are all hyperbolic logarithms, i.e., the logarithm from the integral of a hyperbola, i.e., 1/x.
The original proof by De Moirvre was only published in supplemental information in his miscellanea analysis. And previous papers for the Royal Society, after the Newton Leibniz fiasco, folks were very cautious to make sure their results were tracked, so often De Moivre would give copies of his results to colleagues for “safe” keeping, so in case there were any “you are stealing my results” attacks he would have evidence. You can tell by the often defensive language, “It is now X years since I found what follows; …” throughout his text.
He shows a solution for the logarithm of the ratio of the central coefficient and a coefficient L distance away from the central follows the typical large N approximations
This, of course, is the crux, the normal distribution:
But there is more, not only does he give you the normal distribution, he gives you the standard deviation for poison-ic counting. Since we are dealing with counting experiments, he didn’t see the generalization to real standard deviations. Still, here he gives you not only the std ~ \sqrt(n) but also the probability of being within that bound, 68%!
Note that none involved the Euler’s Number or the gamma function definition, usually how Stirling’s approximation is presented in most courses.
Overall, I loved his writing and pretended he was being sarcastic throughout, though I have no idea whether that is the case. I highly recommend the preface if you are into mathematical or physics history.
We can trace Newton’s and De Moivre's work directly to Boltzmann and Gibbs’s statistical interpretation of thermodynamics. We will see where we end up next; I might jump directly to higher dimensions and discuss metrics in those spaces since they might be of interest to Machine Learning enthusiasts.
Doctrine of Chances, 2nd Edition https://catalog.lindahall.org/discovery/delivery/01LINDAHALL_INST:LHL/1287366660005961
Doctrine of Chances, 3rd Edition https://openlibrary.org/books/OL6239276M/The_doctrine_of_chances
https://arxiv.org/pdf/0708.3965.pdf