Tiny Little Things in Data Science

Peter Duronelly

May 14, 2019 3 min read

Detect Csv Delimiters

One of the most frequent format for data import and export in python is CSV. Reading and loading a CSV file to pandas is straightforward – assuming you know the separator, or the separator is a comma. While the name Comma Separated Values implies CSV file automatically use comma as separator (also called delimiter) this is not always the case. Depending on your settings the separator can be anything from semicolons to pipe character.

Peter Duronelly

Nov 20, 2018 5 min read

Train Test Splitting Time Series Data

Machine learning is not the ideal tool for time series forecasting for a number of reasons, but, as I will demonstrate it in a future post, limited models can be built for short-term forecasting exercises. One aspect of time series data is, however, that you can’t split your observations randomly into train and test subsets: you train on an early interval and test on a later one. Standard ML libraries, such as scikit-learn, don’t provide a tool for that.

Peter Duronelly

Oct 9, 2018 5 min read

Comparing Beatles and Bob Dylan

In a previous post I showed how to use data science tools to find hidden features in unstructured text and analyzed how the complexity of the lyrics of Beatles songs changed over time. In this post I do a little follow-up and compare complete works of The Beatles with that of two others using the same methodology and metrics. Comparing Beatles with other musicians may help put the original numbers into the perspective.

Peter Duronelly

Oct 2, 2018 6 min read

Robust and Clustered Standard Errors in Stargazer

Stargazer is a neat tool to present model estimates. It accepts a fairly large number of object-types and creates nice-looking, ready-to-publish outputs of their main parameters. In many cases, however, the default settings do not give us the proper numerical results, and customizing the output is not that straightforward. This is part one in a two-part series on how to customize stargazer. When I first encountered stargazer I already had a problem with the model outputs the package created: in cross-sectional data the observations are often of different sizes, which leads to heteroskedastic model residuals where simple standard errors are useless for measuring variable significance.

Peter Duronelly

Sep 23, 2018 2 min read

Bootstrap Samples

Bootstrap sampling is a widely used method in machine learning and in statistics. The main idea is that we try to decrease overfitting and the chance of myopic tree-building if run our algorithm multiple times using the same data, but always taking a different sample with repetitions from our original data. (For instance, random forest builds the trees using repeated bootstrap samples.) On a machine learning class one of my class mates asked what percentage of the original data shows up in the bootstrapped sample.

Peter Duronelly

a former financial market professional transitioning to data/digital

Biography

Interests

Education

Recent Posts

Detect Csv Delimiters

Train Test Splitting Time Series Data

Comparing Beatles and Bob Dylan

Robust and Clustered Standard Errors in Stargazer

Bootstrap Samples

Tags

Contact