🚀 Machine Learning Engineer @ Cosmose AI| 👨‍🏫 Data Science Instructor @ DataCamp | 🌐 https://michaloleszak.github.io/

On the relevance of the cornerstone of statistical inference for data scientists.

Image for post
Image for post

Central Limit Theorem, or CLT, is taught in every STATS101 class. A typical way of introducing this topic is by presenting the formulae, discussing the assumptions, and going through a couple of calculations involving the normal density function. What’s missing is CLT’s relevance for data scientists’ day-to-day work. Let me try to point it out.


A short primer on why can reject hypotheses, but cannot accept them, with examples and visuals.

Image for post
Image for post

Hypothesis testing is the basis of the classical statistical inference. It’s a framework for making decisions under uncertainty with the goal to prevent you from making stupid decisions — provided there is data to verify their stupidity. If there is no such data… ¯\_(ツ)_/¯

The goal of hypothesis testing is to prevent you from making stupid decisions — provided there is data to verify their stupidity.

The catch here is that you can only use hypothesis testing to dismiss a choice as a stupid…


A statistician’s perspective on how (not to) do it to keep your machine learning workflow unflawed.

Image for post
Image for post

Recently, I couldn't help but notice something alarming about the popular machine learning books. Even the best titles that do a great job explaining the algorithms and their applications, tend to neglect one important aspect. In cases where statistical rigor is needed to do things properly, they often suggest dangerously over-simplified solutions, causing severe headache to a statistician-by-training such as myself, and detrimentally impacting the machine learning workflow.

Even the best machine learning books tend to neglect topics in which statistical rigor is needed to do things properly, proposing dangerously over-simplified solutions instead.

A couple of weeks back, I have…


A statistician’s perspective on the types of variables, their meaning, and implications for machine learning.

Image for post
Image for post

I’ve been reading a popular book on machine learning recently. Once I reached the chapter on feature engineering, the author noted that, since most machine learning algorithms require numeric data as input, categorical variables need to be encoded as numeric ones. For instance, to paraphrase the example, we could encode a categorical variable education_level which takes the values: elementary, high_school, university, as numbers 1, 2, and 3, respectively. At that point, even though I’m an ML Engineer by trade, I heard the inner statistician-by-training within me cry out loud! Do people just run .fit_predict()


Get to the neighborhood of optimal values quickly without costly searches.

Image for post
Image for post

The learning rate is arguably the most important hyperparameter to tune in a neural network. Unfortunately, it is also one of the hardest to tune properly. But don’t despair, for the Learning Rate Finder will get you to pretty decent values quickly! Let’s see how it works and how to implement it in TensorFlow.


Improve your neural network for free with one small trick, getting model uncertainty estimate as a bonus.

There ain’t no such thing as a free lunch, at least according to the popular adage. Well, not anymore! Not when it comes to neural networks, that is to say. Read on to see how to improve your network’s performance with an incredibly simple yet clever trick called the Monte Carlo Dropout.

Image for post
Image for post

Dropout

The magic trick we are about to introduce only works if your neural network has dropout layers, so let’s kick off with briefly introducing these. Dropout boils down to simply switching-off some neurons at each training step. At each step, a different set of neurons are switched off…


How I made my Dockerfiles stop ignoring .dockerignores

Image for post
Image for post

I have been working on this project recently in which a couple of docker containers are built along the way and they end up being sent to different third-party servers. Due to privacy reasons, some specific files must not be sent to particular servers. Hence, each container has its own blacklist of files it should not accept inside. This should be handled by the .dockerignore files, except that my .dockerignores got, well, ignored (no pun intended).

It took me hours to find the solution, which, obviously, turned out to be a one-liner.

I hope I can save you some miserable…


An intuitive visual explanation

Image for post
Image for post

You may have heard about the so-called kernel trick, a maneuver that allows support vector machines, or SVMs, to work well with non-linear data. The idea is to map the data into a high-dimensional space in which it becomes linear and then apply a simple, linear SVM. Sounds sophisticated and to some extent it is. However, while it might be hard to understand how the kernels work, it is pretty easy to grasp what they are trying to achieve. Read on to see it for yourself!


The basics cheat sheet for ignorants such as myself.

Image for post
Image for post

I don’t use SQL very often and every time I need it, I find myself googling for the syntax of even the most basic operations. To help myself out, I have put together a cheat sheet of useful queries, all in one place. I hope you will find it useful, too.

The queries are in Postgres, but the patterns are translatable to other SQL flavors. The notes are based on the great DataCamp courses, such as Introduction to SQL, Joining Data in SQL, and Introduction to Relational Databases in SQL, as well as on my own StackOverflow searches. Enjoy!

Key to symbols:

  • 🟠…


Are you sure your model returns probabilities? 🎲

Image for post
Image for post

Most machine learning models for classification output numbers between 0 and 1 that we tend to interpret as probabilities of the sample belonging to respective classes. In scikit-learn, for instance, we can obtain them by calling a predict_proba() method on the model. Proba, like in ‘probabilities’, right? These numbers typically sum up to one for all classes, confirming our belief that they are probabilities. But are they? Well, usually no, and here is why.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store