SimHash explained

In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar. In 2021 Google announced its intent to also use the algorithm in their newly created FLoC (Federated Learning of Cohorts) system.^[1]

Evaluation and benchmarks

A large scale evaluation has been conducted by Google in 2006^[2] to compare the performance of Minhash and Simhash^[3] algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling^[4] and using Minhash and LSH for Google News personalization.^[5]

External links

Simhash Princeton Paper
Simhash explained
Comparison of MinHash vs. Simhash

Notes and References

Web site: Cyphers. Bennett. 2021-03-03. Google's FLoC Is a Terrible Idea. 2021-04-13. Electronic Frontier Foundation.
.
.
.
.

SimHash explained

Evaluation and benchmarks

See also

External links

Notes and References