Probability distributions are mathematical functions describing probabilities of things happening. Many processes taking place in the world around us can be described by a handful of distributions that have been well-researched and analyzed. Getting one’s head around these few goes a long way towards being able to statistically model a range of phenomena. Let’s take a look at six useful probability distributions!
The Monty Hall problem is a decades-old brain teaser that’s still confusing people today. It is loosely based on an old American TV game show and is named after its host, Monty Hall. At the final stage of the game, the contestants would face a choice in which, by choosing correctly, they could double their chance of winning a brand-new car. But guess what: most of them did not! Would you be wiser? Read on to find out!
Central Limit Theorem, or CLT, is taught in every STATS101 class. A typical way of introducing this topic is by presenting the formulae, discussing the assumptions, and going through a couple of calculations involving the normal density function. What’s missing is CLT’s relevance for data scientists’ day-to-day work. Let me try to point it out.
A short primer on why can reject hypotheses, but cannot accept them, with examples and visuals.
Hypothesis testing is the basis of the classical statistical inference. It’s a framework for making decisions under uncertainty with the goal to prevent you from making stupid decisions — provided there is data to verify their stupidity. If there is no such data… ¯\_(ツ)_/¯
The goal of hypothesis testing is to prevent you from making stupid decisions — provided there is data to verify their stupidity.
The catch here is that you can only use hypothesis testing to dismiss a choice as a stupid…
Recently, I couldn't help but notice something alarming about the popular machine learning books. Even the best titles that do a great job explaining the algorithms and their applications, tend to neglect one important aspect. In cases where statistical rigor is needed to do things properly, they often suggest dangerously over-simplified solutions, causing severe headache to a statistician-by-training such as myself, and detrimentally impacting the machine learning workflow.
Even the best machine learning books tend to neglect topics in which statistical rigor is needed to do things properly, proposing dangerously over-simplified solutions instead.
A couple of weeks back, I have…
I’ve been reading a popular book on machine learning recently. Once I reached the chapter on feature engineering, the author noted that, since most machine learning algorithms require numeric data as input, categorical variables need to be encoded as numeric ones. For instance, to paraphrase the example, we could encode a categorical variable education_level which takes the values: elementary, high_school, university, as numbers 1, 2, and 3, respectively. At that point, even though I’m an ML Engineer by trade, I heard the inner statistician-by-training within me cry out loud! Do people just run .fit_predict()
…
The learning rate is arguably the most important hyperparameter to tune in a neural network. Unfortunately, it is also one of the hardest to tune properly. But don’t despair, for the Learning Rate Finder will get you to pretty decent values quickly! Let’s see how it works and how to implement it in TensorFlow.
There ain’t no such thing as a free lunch, at least according to the popular adage. Well, not anymore! Not when it comes to neural networks, that is to say. Read on to see how to improve your network’s performance with an incredibly simple yet clever trick called the Monte Carlo Dropout.
The magic trick we are about to introduce only works if your neural network has dropout layers, so let’s kick off with briefly introducing these. Dropout boils down to simply switching-off some neurons at each training step. At each step, a different set of neurons are switched off…
I have been working on this project recently in which a couple of docker containers are built along the way and they end up being sent to different third-party servers. Due to privacy reasons, some specific files must not be sent to particular servers. Hence, each container has its own blacklist of files it should not accept inside. This should be handled by the .dockerignore files, except that my .dockerignores got, well, ignored (no pun intended).
It took me hours to find the solution, which, obviously, turned out to be a one-liner.
I hope I can save you some miserable…
You may have heard about the so-called kernel trick, a maneuver that allows support vector machines, or SVMs, to work well with non-linear data. The idea is to map the data into a high-dimensional space in which it becomes linear and then apply a simple, linear SVM. Sounds sophisticated and to some extent it is. However, while it might be hard to understand how the kernels work, it is pretty easy to grasp what they are trying to achieve. Read on to see it for yourself!