Central Limit Theorem, or CLT, is taught in every STATS101 class. A typical way of introducing this topic is by presenting the formulae, discussing the assumptions, and going through a couple of calculations involving the normal density function. What’s missing is CLT’s relevance for data scientists’ day-to-day work. Let me try to point it out.
A short primer on why can reject hypotheses, but cannot accept them, with examples and visuals.
Hypothesis testing is the basis of the classical statistical inference. It’s a framework for making decisions under uncertainty with the goal to prevent you from making stupid decisions — provided there is data to verify their stupidity. If there is no such data… ¯\_(ツ)_/¯
The goal of hypothesis testing is to prevent you from making stupid decisions — provided there is data to verify their stupidity.
The catch here is that you can only use hypothesis testing to dismiss a choice as a stupid…
Recently, I couldn't help but notice something alarming about the popular machine learning books. Even the best titles that do a great job explaining the algorithms and their applications, tend to neglect one important aspect. In cases where statistical rigor is needed to do things properly, they often suggest dangerously over-simplified solutions, causing severe headache to a statistician-by-training such as myself, and detrimentally impacting the machine learning workflow.
Even the best machine learning books tend to neglect topics in which statistical rigor is needed to do things properly, proposing dangerously over-simplified solutions instead.
A couple of weeks back, I have…
I’ve been reading a popular book on machine learning recently. Once I reached the chapter on feature engineering, the author noted that, since most machine learning algorithms require numeric data as input, categorical variables need to be encoded as numeric ones. For instance, to paraphrase the example, we could encode a categorical variable education_level which takes the values: elementary, high_school, university, as numbers 1, 2, and 3, respectively. At that point, even though I’m an ML Engineer by trade, I heard the inner statistician-by-training within me cry out loud! Do people just run .fit_predict()
…
The learning rate is arguably the most important hyperparameter to tune in a neural network. Unfortunately, it is also one of the hardest to tune properly. But don’t despair, for the Learning Rate Finder will get you to pretty decent values quickly! Let’s see how it works and how to implement it in TensorFlow.
There ain’t no such thing as a free lunch, at least according to the popular adage. Well, not anymore! Not when it comes to neural networks, that is to say. Read on to see how to improve your network’s performance with an incredibly simple yet clever trick called the Monte Carlo Dropout.
The magic trick we are about to introduce only works if your neural network has dropout layers, so let’s kick off with briefly introducing these. Dropout boils down to simply switching-off some neurons at each training step. At each step, a different set of neurons are switched off…
I have been working on this project recently in which a couple of docker containers are built along the way and they end up being sent to different third-party servers. Due to privacy reasons, some specific files must not be sent to particular servers. Hence, each container has its own blacklist of files it should not accept inside. This should be handled by the .dockerignore files, except that my .dockerignores got, well, ignored (no pun intended).
It took me hours to find the solution, which, obviously, turned out to be a one-liner.
I hope I can save you some miserable…
You may have heard about the so-called kernel trick, a maneuver that allows support vector machines, or SVMs, to work well with non-linear data. The idea is to map the data into a high-dimensional space in which it becomes linear and then apply a simple, linear SVM. Sounds sophisticated and to some extent it is. However, while it might be hard to understand how the kernels work, it is pretty easy to grasp what they are trying to achieve. Read on to see it for yourself!
I don’t use SQL very often and every time I need it, I find myself googling for the syntax of even the most basic operations. To help myself out, I have put together a cheat sheet of useful queries, all in one place. I hope you will find it useful, too.
The queries are in Postgres, but the patterns are translatable to other SQL flavors. The notes are based on the great DataCamp courses, such as Introduction to SQL, Joining Data in SQL, and Introduction to Relational Databases in SQL, as well as on my own StackOverflow searches. Enjoy!
Most machine learning models for classification output numbers between 0 and 1 that we tend to interpret as probabilities of the sample belonging to respective classes. In scikit-learn, for instance, we can obtain them by calling a predict_proba()
method on the model. Proba, like in ‘probabilities’, right? These numbers typically sum up to one for all classes, confirming our belief that they are probabilities. But are they? Well, usually no, and here is why.