Posts | Tiny Little Things in Data Science

Peter Duronelly

May 14, 2019 3 min read

Detect Csv Delimiters

One of the most frequent format for data import and export in python is CSV. Reading and loading a CSV file to pandas is straightforward – assuming you know the separator, or the separator is a comma. While the name Comma Separated Values implies CSV file automatically use comma as separator (also called delimiter) this is not always the case. Depending on your settings the separator can be anything from semicolons to pipe character.

Peter Duronelly

Nov 20, 2018 5 min read

Train Test Splitting Time Series Data

Machine learning is not the ideal tool for time series forecasting for a number of reasons, but, as I will demonstrate it in a future post, limited models can be built for short-term forecasting exercises. One aspect of time series data is, however, that you can’t split your observations randomly into train and test subsets: you train on an early interval and test on a later one. Standard ML libraries, such as scikit-learn, don’t provide a tool for that.

Peter Duronelly

Oct 9, 2018 5 min read

Comparing Beatles and Bob Dylan

In a previous post I showed how to use data science tools to find hidden features in unstructured text and analyzed how the complexity of the lyrics of Beatles songs changed over time. In this post I do a little follow-up and compare complete works of The Beatles with that of two others using the same methodology and metrics. Comparing Beatles with other musicians may help put the original numbers into the perspective.

Peter Duronelly

Oct 2, 2018 6 min read

Robust and Clustered Standard Errors in Stargazer

Stargazer is a neat tool to present model estimates. It accepts a fairly large number of object-types and creates nice-looking, ready-to-publish outputs of their main parameters. In many cases, however, the default settings do not give us the proper numerical results, and customizing the output is not that straightforward. This is part one in a two-part series on how to customize stargazer. When I first encountered stargazer I already had a problem with the model outputs the package created: in cross-sectional data the observations are often of different sizes, which leads to heteroskedastic model residuals where simple standard errors are useless for measuring variable significance.

Peter Duronelly

Sep 23, 2018 2 min read

Bootstrap Samples

Bootstrap sampling is a widely used method in machine learning and in statistics. The main idea is that we try to decrease overfitting and the chance of myopic tree-building if run our algorithm multiple times using the same data, but always taking a different sample with repetitions from our original data. (For instance, random forest builds the trees using repeated bootstrap samples.) On a machine learning class one of my class mates asked what percentage of the original data shows up in the bootstrapped sample.

Peter Duronelly

Sep 14, 2018 9 min read

Clustering on a Dissimilarity Matrix

Clustering is one of the well-known unsupervised learning tools. In the standard case you have an observation matrix where observations are in rows and variables which describe them are in columns. But data can also be structured in a different way, just like the distance matrix on a map. In this case observations are by both rows and columns and each element in the observation matrix is a measure of distance, or dissimilarity, between any two observations.

Peter Duronelly

Sep 13, 2018 12 min read

Text Complexity Analysis of Beatles Lyrics With R

The Beatles became a hit through its sometimes simple but always powerful music but it has never been famous for its poetry. The group’s lyrics, however, did change during the band’s short existence and we can use text analysis to track these changes. This post is about measuring the change in the complexity of the group’s lyrics, from the Please, Please Me to the Abbey Road albums, showing how we can use basic data secience tools to find really fancy patterns in unstructured text data.

Peter Duronelly

Sep 13, 2018 6 min read

Sparse Matrices in Python

One of the things we need to manage in data analysis is recources. When we have large amounts of (‘big’) data this can become a serious issue. One of the cases when we need to consider whether we really need all the data we have is when we have a lot of zeros in our database, and these zeroes happen to be irrelevant for our calculations. Python’s SciPy library has a solution to store and handle sparse data matrices which contain a large number of irrelevant zero values.

Peter Duronelly

Sep 10, 2018 2 min read

Introduction

Who Am I? I am originally trainded as an economist with a Finance major and a Statistics minor. I spent 21 years on financial markets as an analyst and asset manager. After a while, though, I got interested in disruptive technologies, mostly data and data science, which are closely linked to my previous academic studies. Statistics is especially close to my heart as I used to teach it to sophomore students on the Budapest University of Economics (currently Corvinus University).