Depending on your point of view, locality-sensitive hashing can be seen as an alternative to clustering (LSH is already meant to put things that are similar close together in a lower dim vector space), a dimensionality reduction technique or a way to quickly retrieve candidates that are hopefully close together…

So when you ask how to do hierarchical clustering on results from LSH, you could either just apply hierarchical clustering on the lower dim LSH vector space representation of each item, or you could use LSH to quickly retrieve candidates. In the former case, your vector space will very likely have weird properties and will require attention to the selection of your clustering variant and the used distance function. In the latter case, i’d suggest to try and get the original vector space representation (pre-LSH) and then cluster the candidates in that vector space, which is probably much easier to “interpret”.

]]>hmmm, not sure if i understand the question. I’d say you that as soon as you have some vector-space representation of your data, you can use clustering techniques. The main difficulty however is finding a meaningful distance measure for your specific use-case.

In your case, the combination of geo-locs, time series and temperatures will most likely not be euclidean 😉 So you’ll want to define your own distance measure by asking yourself something like: “What’s the distance of two datapoints that are 1km apart and have the same temperature but are recorded 4 months apart?” Vary all combinations of the dimensions and think about how you’d like your “distance” to change. Also be aware that many clustering variants will require certain properties of your defined “distance” like (symmetry, reflexivity, transitivity, triangle inequality…).

]]>Now, there comes the question. Is this applicable to time series? My data matrix consists of different daily temperature time series for different locations on each column. I would like to build clusters so I can regionally group locations based on temperature time series. Do you think your example could be applied for my problem?

Thanks in advance and best regards

]]>Thank you, Jörn. ]]>

So, if I already have the distance matrix between my data, I have to pass the matrix in triangular shape, otherwise it will be interpreted as the dataset?

In my case, I have pseudo-distance, since diagonal is not zero, but constant equal to 0.5. Should I keep it in the triangular matrix or only the upper part?

Regards.

]]>