Recent Posts

More Posts

One of the most frequent format for data import and export in python is CSV. Reading and loading a CSV file to pandas is straightforward – assuming you know the separator, or the separator is a comma. While the name Comma Separated Values implies CSV file automatically use comma as separator (also called delimiter) this is not always the case. Depending on your settings the separator can be anything from semicolons to pipe character.

CONTINUE READING

Machine learning is not the ideal tool for time series forecasting for a number of reasons, but, as I will demonstrate it in a future post, limited models can be built for short-term forecasting exercises. One aspect of time series data is, however, that you can’t split your observations randomly into train and test subsets: you train on an early interval and test on a later one. Standard ML libraries, such as scikit-learn, don’t provide a tool for that.

CONTINUE READING

In a previous post I showed how to use data science tools to find hidden features in unstructured text and analyzed how the complexity of the lyrics of Beatles songs changed over time. In this post I do a little follow-up and compare complete works of The Beatles with that of two others using the same methodology and metrics. Comparing Beatles with other musicians may help put the original numbers into the perspective.

CONTINUE READING

Stargazer is a neat tool to present model estimates. It accepts a fairly large number of object-types and creates nice-looking, ready-to-publish outputs of their main parameters. In many cases, however, the default settings do not give us the proper numerical results, and customizing the output is not that straightforward. This is part one in a two-part series on how to customize stargazer. When I first encountered stargazer I already had a problem with the model outputs the package created: in cross-sectional data the observations are often of different sizes, which leads to heteroskedastic model residuals where simple standard errors are useless for measuring variable significance.

CONTINUE READING

Bootstrap sampling is a widely used method in machine learning and in statistics. The main idea is that we try to decrease overfitting and the chance of myopic tree-building if run our algorithm multiple times using the same data, but always taking a different sample with repetitions from our original data. (For instance, random forest builds the trees using repeated bootstrap samples.) On a machine learning class one of my class mates asked what percentage of the original data shows up in the bootstrapped sample.

CONTINUE READING