Preface

These are my notes on Bayesian Statistics.

They are based on online courses I took after I worked as a Data Scientist for a number of years. During my time as a Data Scientist, I often felt that if I was better at probability and statistics I could overcome certain limitations within the data using more sophisticated mathematics, build even better models, and perhaps most importantly make better inferences using models I have already developed.

I can say through introspection that this is not the consensus view in the industry but I tend to attribute this to the reality that a very high proportion of Data Scientists that I met never took a formal course in statistics or probability. Coming from computer science, engineering, fullstack development, or even physics or even other advanced degrees they learn some of recent ML techniques and reference certain Data science and Machine Learning Bibles such as (VanderPlas 2016) or (Bishop 2006). While such manuals are of great use to a working Data Scientist, much like (Fisher 1925) and later edition were great use to researchers who were more focused on their research and less so about pushing the limits of what statistics could do for them.

However, I did get to interact with some very talented Data Scientists and many did have solid backgrounds in statistics and probability and they were able to challenge my thinking and these led me to believe aside from learning many ML models and techniques that there is some room for making better use of statistics and probability theory in Data Science. This was only reinforced as I took more advanced courses on Reinforcement Learning, Continuous Control Natural language programming, Recommendation systems, Neural Networks, agent based modeling and Microeconomics.

Almost all of these areas reached to a few mathematical themes captured by statistics and probability theory. One view is that all these areas tended to make use of a rather small set of mathematical results.

Another was that a better way to understand what we want to get out of modeling and to put it all together into a simple end to end formulation was most readily approach via Bayesian hierarchical models, its rich EDA, and diagnostics and visualization tools supplemented using graphical models and causal inference, and counterfactual.

A second insight I had was that in most cases statistical courses and books use toy problems and examples, and in many cases the real world hits us with problems of much greater complexity. It then becomes our headache to try to find the simplest models that can capture the essence of the problem without succumbing to the complexity of reality. I can point out that Physicists tend to have the best training in this regard. Most of their lessons tend to start with an abstract problem and then jump to a first and occasional second order taylor series approximation. They are also trained to think about uncertainty quantification as result +/- some error.

I tried to review the main results in statistics and probability theory, and I started to get better results and so I decided to take some classes. I often felt that the material was introductory and simplistic. Real world problems are monsters and the examples are usually trivial in comparison. The issues raised in class are rarely the troubles I saw at work. But I do believe deep down that the two are somehow related. And indeed as I covered more material certain aspects began to connect.

I took the time to keep track of many of these monsters and relised they were more like gremlins - small, often a nuisance, but possible to keep in check if you follow a few simple precautions. More significantly though keeping many of these issues in mind helped me take notice that these have good established solutions I had forgotten or never learned.

I noticed that like in real classes the teachers made mistakes, their motivation was not always clear and that they skipped steps or referred to certain results. On the other hand each teacher offered different insights into this complicated area of data analysis.

I therefore have a number main areas of Focus:

What are the questions one should ask
What are the explicit details of each examples or problem.
What is the mathematical representation of the model for these
What is the code representation for this in R and in Python.
Can I find or create diagrams to make interpretation of probabilities clearer.
Can I annotate the main equations to break them down.
Can I keep track of the most useful Mathematical results in Probability and Statistics?
1. I tried to keep these handy as appendices to keep the main material less cluttered. However as time goes by these keep expanding.
Can I learn to communicate my intuitions of these concepts to laymen and to colleagues
1. Some courses have discussion prompts, but I tried to make the most of these, and included them in these notes.
2. I realized that while I often get excited about a new paper or technique my colleges are smart and talented people and quickly ask questions that are difficult to answer. These question often indicate gaps of knowledge which can dampen the enthusiasm for implementing these new ideas.
3. I found that good communicators can overcome these issues more readily and connect what they already know.

I have often found exercises rather easy to solve and so I often breezed through them with little notice. This time I resolved to make the most of these as opportunities to get better at using and manipulating probabilities and posterior distributions. So although I could get around 75% in each exercises based on intuition I took the extra effort to understand what is realty going on here mathematically.

At work one of the main challenges is making the problem conform to a simple model. This can be even more challenging when the goal is a latent (unobserved) variable or when you are considering a synergy of multiple effects and you have seen and unseen confounds. In many cases it is unclear how to proceed based on the simple examples we see in these classes. However I am now able to look at the problems with more critical point of view. Also I see great advantages of a quick expositions to many new simple models. In Bayesian hierarchical framework each can become a link to adding just a little more complexity, integrating new types of prior information and so on. So I view these as jumping boards for overcoming my next challenges.

References

(Casella and Berger 2002) e-book solutions
(Spanos 2019)
(Hobbs and Hooten 2015) ebook website
(VanderPlas 2016) ebook notebooks
(Bishop 2006) ebook website

# Preface {.unnumbered} These are my notes on Bayesian Statistics. They are based on online courses I took after I worked as a Data Scientist for a number of years. During my time as a Data Scientist, I often felt that if I was better at probability and statistics I could overcome certain limitations within the data using more sophisticated mathematics, build even better models, and perhaps most importantly make better inferences using models I have already developed. I can say through introspection that this is not the consensus view in the industry but I tend to attribute this to the reality that a very high proportion of Data Scientists that I met never took a formal course in statistics or probability. Coming from computer science, engineering, fullstack development, or even physics or even other advanced degrees they learn some of recent ML techniques and reference certain Data science and Machine Learning `Bibles` such as [@vanderplas2016python] or [@bishop2013pattern]. While such manuals are of great use to a working Data Scientist, much like [@Fisher1925Statistical] and later edition were great use to researchers who were more focused on their research and less so about pushing the limits of what statistics could do for them. However, I did get to interact with some very talented Data Scientists and many did have solid backgrounds in statistics and probability and they were able to challenge my thinking and these led me to believe aside from learning many ML models and techniques that there is some room for making better use of statistics and probability theory in Data Science. This was only reinforced as I took more advanced courses on Reinforcement Learning, Continuous Control Natural language programming, Recommendation systems, Neural Networks, agent based modeling and Microeconomics. Almost all of these areas reached to a few mathematical themes captured by statistics and probability theory. One view is that all these areas tended to make use of a rather small set of mathematical results. Another was that a better way to understand what we want to get out of modeling and to put it all together into a simple end to end formulation was most readily approach via Bayesian hierarchical models, its rich EDA, and diagnostics and visualization tools supplemented using graphical models and causal inference, and counterfactual. A second insight I had was that in most cases statistical courses and books use toy problems and examples, and in many cases the real world hits us with problems of much greater complexity. It then becomes our headache to try to find the simplest models that can capture the essence of the problem without succumbing to the complexity of reality. I can point out that Physicists tend to have the best training in this regard. Most of their lessons tend to start with an abstract problem and then jump to a first and occasional second order taylor series approximation. They are also trained to think about uncertainty quantification as result +/- some error. I tried to review the main results in statistics and probability theory, and I started to get better results and so I decided to take some classes. I often felt that the material was introductory and simplistic. Real world problems are monsters and the examples are usually trivial in comparison. The issues raised in class are rarely the troubles I saw at work. But I do believe deep down that the two are somehow related. And indeed as I covered more material certain aspects began to connect. I took the time to keep track of many of these `monsters` and relised they were more like gremlins - small, often a nuisance, but possible to keep in check if you follow a few simple precautions. More significantly though keeping many of these issues in mind helped me take notice that these have good established solutions I had forgotten or never learned. I noticed that like in real classes the teachers made mistakes, their motivation was not always clear and that they skipped steps or referred to certain results. On the other hand each teacher offered different insights into this complicated area of data analysis. I therefore have a number main areas of Focus: 1. What are the questions one should ask 2. What are the explicit details of each examples or problem. 3. What is the mathematical representation of the model for these 4. What is the code representation for this in R and in Python. 5. Can I find or create diagrams to make interpretation of probabilities clearer. 6. Can I annotate the main equations to break them down. 7. Can I keep track of the most useful Mathematical results in Probability and Statistics? 1. I tried to keep these handy as appendices to keep the main material less cluttered. However as time goes by these keep expanding. 8. Can I learn to communicate my intuitions of these concepts to laymen and to colleagues 1. Some courses have *discussion prompts*, but I tried to make the most of these, and included them in these notes. 2. I realized that while I often get excited about a new paper or technique my colleges are smart and talented people and quickly ask questions that are difficult to answer. These question often indicate gaps of knowledge which can dampen the enthusiasm for implementing these new ideas. 3. I found that good communicators can overcome these issues more readily and connect what they already know. I have often found exercises rather easy to solve and so I often breezed through them with little notice. This time I resolved to make the most of these as opportunities to get better at using and manipulating probabilities and posterior distributions. So although I could get around 75% in each exercises based on intuition I took the extra effort to understand what is realty going on here mathematically. At work one of the main challenges is making the problem conform to a simple model. This can be even more challenging when the goal is a **latent** (unobserved) variable or when you are considering a synergy of **multiple effects** and you have seen and unseen **confounds**. In many cases it is unclear how to proceed based on the simple examples we see in these classes. However I am now able to look at the problems with more critical point of view. Also I see great advantages of a quick expositions to many new simple models. In Bayesian hierarchical framework each can become a link to adding just a little more complexity, integrating new types of prior information and so on. So I view these as jumping boards for overcoming my next challenges. ## References - [@casella2002statistical] [e-book](http://home.ustc.edu.cn/\~zt001062/MaStmaterials/George%20Casella&Roger%20L.Berger--Statistical%20Inference.pdf) [solutions](https://contacts.ucalgary.ca/info/math/files/info/unitis/courses/STAT723/W2010/LEC1/STAT723-W10-LEC1-Publisher%27s-Solution-to-All-Problems-in-Text.pdf) - [@spanos2019probability] - [@hobbs2015bayesian] [ebook](https://esl.hohoweiya.xyz/references/A_First_Course_in_Bayesian_Statistical_Methods.pdf) [website](https://pdhoff.github.io/book/) - [@vanderplas2016python] [ebook](https://jakevdp.github.io/PythonDataScienceHandbook/) [notebooks](https://github.com/jakevdp/PythonDataScienceHandbook) - [@bishop2013pattern] [ebook](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) [website](https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/)