SciPy | Tiny Little Things in Data Science

Clustering on a Dissimilarity Matrix

Clustering is one of the well-known unsupervised learning tools. In the standard case you have an observation matrix where observations are in rows and variables which describe them are in columns. But data can also be structured in a different way, just like the distance matrix on a map. In this case observations are by both rows and columns and each element in the observation matrix is a measure of distance, or dissimilarity, between any two observations.

Sparse Matrices in Python

One of the things we need to manage in data analysis is recources. When we have large amounts of (‘big’) data this can become a serious issue. One of the cases when we need to consider whether we really need all the data we have is when we have a lot of zeros in our database, and these zeroes happen to be irrelevant for our calculations. Python’s SciPy library has a solution to store and handle sparse data matrices which contain a large number of irrelevant zero values.