a former financial market professional transitioning to data/digital
Biography
I am originally trained as an economist with a Finance major and a Statistics minor. I worked as an analyst and mutual fund portfolio manager for 21 years, then decided to switch to data and data science building on my academic background in Statistics. I also used to teach Statistics for sophomore students at Corvinus University of Budapest. Now I am working as a data scientist at Magyar Telekom, a Deutsche Telekom subsidiary.
Interests
Statistics
Data & Data Science
Business & BI
Finance and Financial Markets (still…)
Education
MSc in Business Analytics, 2018
Central European University
MSc in Economics, Finance and Statistics, 1996
Budapest University of Economics (currently Corvinus University of Budapest)
One of the most frequent format for data import and export in python is CSV. Reading and loading a CSV file to pandas is straightforward – assuming you know the separator, or the separator is a comma. While the name Comma Separated Values implies CSV file automatically use comma as separator (also called delimiter) this is not always the case. Depending on your settings the separator can be anything from semicolons to pipe character.
Machine learning is not the ideal tool for time series forecasting for a number of reasons, but, as I will demonstrate it in a future post, limited models can be built for short-term forecasting exercises. One aspect of time series data is, however, that you can’t split your observations randomly into train and test subsets: you train on an early interval and test on a later one. Standard ML libraries, such as scikit-learn, don’t provide a tool for that.
In a previous post I showed how to use data science tools to find hidden features in unstructured text and analyzed how the complexity of the lyrics of Beatles songs changed over time. In this post I do a little follow-up and compare complete works of The Beatles with that of two others using the same methodology and metrics. Comparing Beatles with other musicians may help put the original numbers into the perspective.
Stargazer is a neat tool to present model estimates. It accepts a fairly large number of object-types and creates nice-looking, ready-to-publish outputs of their main parameters. In many cases, however, the default settings do not give us the proper numerical results, and customizing the output is not that straightforward. This is part one in a two-part series on how to customize stargazer.
When I first encountered stargazer I already had a problem with the model outputs the package created: in cross-sectional data the observations are often of different sizes, which leads to heteroskedastic model residuals where simple standard errors are useless for measuring variable significance.
Bootstrap sampling is a widely used method in machine learning and in statistics. The main idea is that we try to decrease overfitting and the chance of myopic tree-building if run our algorithm multiple times using the same data, but always taking a different sample with repetitions from our original data. (For instance, random forest builds the trees using repeated bootstrap samples.) On a machine learning class one of my class mates asked what percentage of the original data shows up in the bootstrapped sample.