Distributed hash table explained

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys.^[1] Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data.^[2] Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

DHTs form an infrastructure that can be used to build more complex services, such as anycast, cooperative web caching, distributed file systems, domain name services, instant messaging, multicast, and also peer-to-peer file sharing and content distribution systems. Notable distributed networks that use DHTs include BitTorrent's distributed tracker, the Kad network, the Storm botnet, the Tox instant messenger, Freenet, the YaCy search engine, and the InterPlanetary File System.

History

DHT research was originally motivated, in part, by peer-to-peer (P2P) systems such as Freenet, Gnutella, BitTorrent and Napster, which took advantage of resources distributed across the Internet to provide a single useful application. In particular, they took advantage of increased bandwidth and hard disk capacity to provide a file-sharing service.^[3]

These systems differed in how they located the data offered by their peers. Napster, the first large-scale P2P content delivery system, required a central index server: each node, upon joining, would send a list of locally held files to the server, which would perform searches and refer the queries to the nodes that held the results. This central component left the system vulnerable to attacks and lawsuits.

Gnutella and similar networks moved to a query flooding model in essence, each search would result in a message being broadcast to every machine in the network. While avoiding a single point of failure, this method was significantly less efficient than Napster. Later versions of Gnutella clients moved to a dynamic querying model which vastly improved efficiency.^[4]

Freenet is fully distributed, but employs a heuristic key-based routing in which each file is associated with a key, and files with similar keys tend to cluster on a similar set of nodes. Queries are likely to be routed through the network to such a cluster without needing to visit many peers. However, Freenet does not guarantee that data will be found.

Distributed hash tables use a more structured key-based routing in order to attain both the decentralization of Freenet and Gnutella, and the efficiency and guaranteed results of Napster. One drawback is that, like Freenet, DHTs only directly support exact-match search, rather than keyword search, although Freenet's routing algorithm can be generalized to any key type where a closeness operation can be defined.

In 2001, four systems - CAN,^[5] Chord,^[6] Pastry, and Tapestry - ignited DHTs as a popular research topic. A project called the Infrastructure for Resilient Internet Systems (Iris) was funded by a $12 million grant from the United States National Science Foundation in 2002.^[7] Researchers included Sylvia Ratnasamy, Ion Stoica, Hari Balakrishnan and Scott Shenker.^[8] Outside academia, DHT technology has been adopted as a component of BitTorrent and in PlanetLab projects such as the Coral Content Distribution Network.^[9]

Properties

DHTs characteristically emphasize the following properties:

Autonomy and decentralization: The nodes collectively form the system without any central coordination.
Fault tolerance: The system should be reliable (in some sense) even with nodes continuously joining, leaving, and failing.^[10]
Scalability: The system should function efficiently even with thousands or millions of nodes.

A key technique used to achieve these goals is that any one node needs to coordinate with only a few other nodes in the systemmost commonly, O(log n) of the n participants (see below)so that only a limited amount of work needs to be done for each change in membership.

Some DHT designs seek to be secure against malicious participants^[11] and to allow participants to remain anonymous, though this is less common than in many other peer-to-peer (especially file sharing) systems; see anonymous P2P.

Structure

The structure of a DHT can be decomposed into several main components.^[12] ^[13] The foundation is an abstract keyspace, such as the set of 160-bit strings. A keyspace partitioning scheme splits ownership of this keyspace among the participating nodes. An overlay network then connects the nodes, allowing them to find the owner of any given key in the keyspace.

Once these components are in place, a typical use of the DHT for storage and retrieval might proceed as follows. Suppose the keyspace is the set of 160-bit strings. To index a file with given and in the DHT, the SHA-1 hash of is generated, producing a 160-bit key, and a message is sent to any node participating in the DHT. The message is forwarded from node to node through the overlay network until it reaches the single node responsible for key as specified by the keyspace partitioning. That node then stores the key and the data. Any other client can then retrieve the contents of the file by again hashing to produce and asking any DHT node to find the data associated with with a message . The message will again be routed through the overlay to the node responsible for, which will reply with the stored .

The keyspace partitioning and overlay network components are described below with the goal of capturing the principal ideas common to most DHTs; many designs differ in the details.

Keyspace partitioning

Most DHTs use some variant of consistent hashing or rendezvous hashing to map keys to nodes. The two algorithms appear to have been devised independently and simultaneously to solve the distributed hash table problem.

Both consistent hashing and rendezvous hashing have the essential property that removal or addition of one node changes only the set of keys owned by the nodes with adjacent IDs, and leaves all other nodes unaffected. Contrast this with a traditional hash table in which addition or removal of one bucket causes nearly the entire keyspace to be remapped. Since any change in ownership typically corresponds to bandwidth-intensive movement of objects stored in the DHT from one node to another, minimizing such reorganization is required to efficiently support high rates of churn (node arrival and failure).

Consistent hashing

Consistent hashing employs a function

\delta(k_1,k₂₎

that defines an abstract notion of the distance between the keys

k₁

and

k₂

, which is unrelated to geographical distance or network latency. Each node is assigned a single key called its identifier (ID). A node with ID

i_x

owns all the keys

k_m

for which

i_x

is the closest ID, measured according to

\delta(k_m,i_x)

For example, the Chord DHT uses consistent hashing, which treats nodes as points on a circle, and

\delta(k_1,k₂₎

is the distance traveling clockwise around the circle from

k₁

k₂

. Thus, the circular keyspace is split into contiguous segments whose endpoints are the node identifiers. If

i₁

and

i₂

are two adjacent IDs, with a shorter clockwise distance from

i₁

i₂

, then the node with ID

i₂

owns all the keys that fall between

i₁

and

i₂

Rendezvous hashing

In rendezvous hashing, also called highest random weight (HRW) hashing, all clients use the same hash function

(chosen ahead of time) to associate a key to one of the n available servers.Each client has the same list of identifiers, one for each server.Given some key k, a client computes n hash weights .The client associates that key with the server corresponding to the highest hash weight for that key.A server with ID

S_x

owns all the keys

k_m

for which the hash weight

h(S_x,k_m)

is higher than the hash weight of any other node for that key.

Locality-preserving hashing

Locality-preserving hashing ensures that similar keys are assigned to similar objects. This can enable a more efficient execution of range queries, however, in contrast to using consistent hashing, there is no more assurance that the keys (and thus the load) is uniformly randomly distributed over the key space and the participating peers. DHT protocols such as Self-Chord and Oscar^[14] address such issues. Self-Chord decouples object keys from peer IDs and sorts keys along the ring with a statistical approach based on the swarm intelligence paradigm.^[15] Sorting ensures that similar keys are stored by neighbour nodes and that discovery procedures, including range queries, can be performed in logarithmic time. Oscar constructs a navigable small-world network based on random walk sampling also assuring logarithmic search time.

Overlay network

Each node maintains a set of links to other nodes (its neighbors or routing table). Together, these links form the overlay network. A node picks its neighbors according to a certain structure, called the network's topology.

All DHT topologies share some variant of the most essential property: for any key, each node either has a node ID that owns or has a link to a node whose node ID is closer to, in terms of the keyspace distance defined above. It is then easy to route a message to the owner of any key using the following greedy algorithm (that is not necessarily globally optimal): at each step, forward the message to the neighbor whose ID is closest to . When there is no such neighbor, then we must have arrived at the closest node, which is the owner of as defined above. This style of routing is sometimes called key-based routing.

Beyond basic routing correctness, two important constraints on the topology are to guarantee that the maximum number of hops in any route (route length) is low, so that requests complete quickly; and that the maximum number of neighbors of any node (maximum node degree) is low, so that maintenance overhead is not excessive. Of course, having shorter routes requires higher maximum degree. Some common choices for maximum degree and route length are as follows, where is the number of nodes in the DHT, using Big O notation:

Max. degree	Max route length	Used in	Note
O(1)	O(n)		Worst lookup lengths, with likely much slower lookups times
O(1)	O(logn)	Koorde (with constant degree)	More complex to implement, but acceptable lookup time can be found with a fixed number of connections
O(logn)	O(logn)		Most common, but not optimal (degree/route length). Chord is the most basic version, with Kademlia seeming the most popular optimized variant (should have improved average lookup)
O(logn)	O(logn/log(logn))	Koorde (with optimal lookup)	More complex to implement, but lookups might be faster (have a lower worst case bound)
O(\sqrt{n})	O(1)		Worst local storage needs, with much communication after any node connects or disconnects

The most common choice,

O(logn)

degree/route length, is not optimal in terms of degree/route length tradeoff, but such topologies typically allow more flexibility in choice of neighbors. Many DHTs use that flexibility to pick neighbors that are close in terms of latency in the physical underlying network. In general, all DHTs construct navigable small-world network topologies, which trade-off route length vs. network degree.^[16]

Maximum route length is closely related to diameter: the maximum number of hops in any shortest path between nodes. Clearly, the network's worst case route length is at least as large as its diameter, so DHTs are limited by the degree/diameter tradeoff that is fundamental in graph theory. Route length can be greater than diameter, since the greedy routing algorithm may not find shortest paths.^[17]

Algorithms for overlay networks

Aside from routing, there exist many algorithms that exploit the structure of the overlay network for sending a message to all nodes, or a subset of nodes, in a DHT.^[18] These algorithms are used by applications to do overlay multicast, range queries, or to collect statistics. Two systems that are based on this approach are Structella,^[19] which implements flooding and random walks on a Pastry overlay, and DQ-DHT, which implements a dynamic querying search algorithm over a Chord network.^[20]

Security

Because of the decentralization, fault tolerance, and scalability of DHTs, they are inherently more resilient against a hostile attacker than a centralized system.

Open systems for distributed data storage that are robust against massive hostile attackers are feasible.^[21]

A DHT system that is carefully designed to have Byzantine fault tolerance can defend against a security weakness, known as the Sybil attack, which affects most current DHT designs.^[22] ^[23] Whanau is a DHT designed to be resistant to Sybil attacks.^[24]

Petar Maymounkov, one of the original authors of Kademlia, has proposed a way to circumvent the weakness to the Sybil attack by incorporating social trust relationships into the system design.^[25] The new system, codenamed Tonika or also known by its domain name as 5ttt, is based on an algorithm design known as "electric routing" and co-authored with the mathematician Jonathan Kelner.^[26] Maymounkov has now undertaken a comprehensive implementation effort of this new system. However, research into effective defences against Sybil attacks is generally considered an open question, and wide variety of potential defences are proposed every year in top security research conferences.

Implementations

Most notable differences encountered in practical instances of DHT implementations include at least the following:

The address space is a parameter of DHT. Several real-world DHTs use 128-bit or 160-bit key space.
Some real-world DHTs use hash functions other than SHA-1.
In the real world the key could be a hash of a file's content rather than a hash of a file's name to provide content-addressable storage, so that renaming of the file does not prevent users from finding it.
Some DHTs may also publish objects of different types. For example, key could be the node and associated data could describe how to contact this node. This allows publication-of-presence information and often used in IM applications, etc. In the simplest case, is just a random number that is directly used as key (so in a 160-bit DHT will be a 160-bit number, usually randomly chosen). In some DHTs, publishing of nodes' IDs is also used to optimize DHT operations.
Redundancy can be added to improve reliability. The key pair can be stored in more than one node corresponding to the key. Usually, rather than selecting just one node, real world DHT algorithms select suitable nodes, with being an implementation-specific parameter of the DHT. In some DHT designs, nodes agree to handle a certain keyspace range, the size of which may be chosen dynamically, rather than hard-coded.
Some advanced DHTs like Kademlia perform iterative lookups through the DHT first in order to select a set of suitable nodes and send messages only to those nodes, thus drastically reducing useless traffic, since published messages are only sent to nodes that seem suitable for storing the key ; and iterative lookups cover just a small set of nodes rather than the entire DHT, reducing useless forwarding. In such DHTs, forwarding of messages may only occur as part of a self-healing algorithm: if a target node receives a message, but believes that is out of its handled range and a closer node (in terms of DHT keyspace) is known, the message is forwarded to that node. Otherwise, data are indexed locally. This leads to a somewhat self-balancing DHT behavior. Of course, such an algorithm requires nodes to publish their presence data in the DHT so the iterative lookups can be performed.
Since on most machines sending messages is much more expensive than local hash table accesses, it makes sense to bundle many messages concerning a particular node into a single batch. Assuming each node has a local batch consisting of at most operations, the bundling procedure is as follows. Each node first sorts its local batch by the identifier of the node responsible for the operation. Using bucket sort, this can be done in, where is the number of nodes in the DHT. When there are multiple operations addressing the same key within one batch, the batch is condensed before being sent out. For example, multiple lookups of the same key can be reduced to one or multiple increments can be reduced to a single add operation. This reduction can be implemented with the help of a temporary local hash table. Finally, the operations are sent to the respective nodes.^[27]

Examples

DHT protocols and implementations

Apache Cassandra
BATON Overlay
Mainline DHT – standard DHT used by BitTorrent (based on Kademlia as provided by Khashmir)^[28]
Content addressable network (CAN)
Chord
Koorde
Kademlia
Pastry
P-Grid
Riak
ScyllaDB
Tapestry
TomP2P
Voldemort

Applications using DHTs

BTDigg: BitTorrent DHT search engine
Codeen: web caching
Freenet: a censorship-resistant anonymous network
GlusterFS: a distributed file system used for storage virtualization
GNUnet: Freenet-like distribution network including a DHT implementation
I2P: An open-source anonymous peer-to-peer network
I2P-Bote: serverless secure anonymous email
IPFS: A content-addressable, peer-to-peer hypermedia distribution protocol
JXTA: open-source P2P platform
LBRY: A blockchain-based content sharing protocol which uses a Kademlia-influenced DHT system for content distribution
Oracle Coherence: an in-memory data grid built on top of a Java DHT implementation
Perfect Dark: a peer-to-peer file-sharing application from Japan
Retroshare: a Friend-to-friend network^[29]
Jami: a privacy-preserving voice, video and chat communication platform, based on a Kademlia-like DHT
Tox: an instant messaging system intended to function as a Skype replacement
Twister: a microblogging peer-to-peer platform
YaCy: a distributed search engine

External links

Distributed Hash Tables, Part 1 by Brandon Wiley.
Distributed Hash Tables links Carles Pairot's Page on DHT and P2P research
kademlia.scs.cs.nyu.edu Archive.org snapshots of kademlia.scs.cs.nyu.edu
10.1.1.111.4197 . IEEE Survey on overlay network schemes . Eng-Keong Lua . Jon . Crowcroft . Marcelo . Pias . Ravi . Sharma . Steve . Lim . . 2005 . covering unstructured and structured decentralized overlay networks including DHTs (Chord, Pastry, Tapestry and others).
Mainline DHT Measurement at Department of Computer Science, University of Helsinki, Finland.

Notes and References

Book: Hota . Chittaranjan . Distributed Computing and Internet Technology: 9th International Conference, ICDCIT 2013, Bhubaneswar, India, February 5-8, 2013, Proceedings . Srimani . Pradip K. . 2013-01-11 . Springer . 978-3-642-36071-8 . en.
Stoica . I. . Ion Stoica . Morris . R. . Karger . D. . David Karger . Kaashoek . M. F. . Balakrishnan . H. . Hari Balakrishnan . Chord: A scalable peer-to-peer lookup service for internet applications . 10.1145/964723.383071 . ACM SIGCOMM Computer Communication Review . 31 . 4 . 149 . 2001 . A value can be an address, a document, or an arbitrary data item. . 2018-09-18 . 2023-07-07 . https://web.archive.org/web/20230707080145/https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf . live .
Liz, Crowcroft . et al . A survey and comparison of peer-to-peer overlay network schemes . IEEE Communications Surveys & Tutorials . 2005 . 7 . 2 . 72–93 . 10.1109/COMST.2005.1610546 . 10.1.1.109.6124 . 7971188 . 2019-09-24 . 2023-10-05 . https://web.archive.org/web/20231005035522/https://www.cl.cam.ac.uk/teaching/2005/AdvSysTop/survey.pdf . live .
Richter, Stevenson. et al . Analysis of the impact of dynamic querying models on client-server relationships . Trends in Modern Computing . 2009 . 682–701.
Ratnasamy . Sylvia . Francis . Paul . Handley . Mark . Karp . Richard . Shenker . Scott . 2001-08-27 . A scalable content-addressable network . SIGCOMM Comput. Commun. Rev. . 31 . 4 . 161–172 . 10.1145/964723.383072 . 0146-4833.
[Hari Balakrishnan]
News: New P2P network funded by US government . David Cohen . New Scientist . October 1, 2002 . November 10, 2013 . April 6, 2008 . https://web.archive.org/web/20080406123915/http://www.newscientist.com/article.ns?id=dn2861 . live .
News: MIT, Berkeley, ICSI, NYU, and Rice Launch the IRIS Project . Press release . MIT . September 25, 2002 . November 10, 2013 . dead . https://web.archive.org/web/20150926070618/https://iris.pdos.csail.mit.edu/MITPressRelease1.doc . September 26, 2015 .
Democratizing content publication with Coral . 2004 . NSDI . 4 . 2024-05-01 .
R Mokadem, A Hameurlain and AM Tjoa. Resource discovery service while minimizing maintenance overhead in hierarchical DHT systems . Proc. iiWas, 2010
Guido Urdaneta, Guillaume Pierre and Maarten van Steen. A Survey of DHT Security Techniques . ACM Computing Surveys 43(2), January 2011.
Moni Naor and Udi Wieder. Novel Architectures for P2P Applications: the Continuous-Discrete Approach . Proc. SPAA, 2003.
Gurmeet Singh Manku. Dipsea: A Modular Distributed Hash Table . Ph. D. Thesis (Stanford University), August 2004.
Girdzijauskas. Šarūnas. Datta. Anwitaman. Aberer. Karl. 2010-02-01. Structured overlay for heterogeneous environments. ACM Transactions on Autonomous and Adaptive Systems. 5. 1. 1–25. 10.1145/1671948.1671950. 13218263. 1556-4665. 2020-03-12. 2020-07-12. https://web.archive.org/web/20200712230838/https://infoscience.epfl.ch/record/134972. live.
Forestiero . Agostino . Leonardi . Emilio . Mastroianni . Carlo . Meo . Michela . Self-Chord: A Bio-Inspired P2P Framework for Self-Organizing Distributed Systems . IEEE/ACM Transactions on Networking . October 2010 . 18 . 5 . 1651–1664 . 10.1109/TNET.2010.2046745 . 14797120 . 2019-07-28 . 2012-07-01 . https://web.archive.org/web/20120701163158/http://porto.polito.it/2370172/ . live .
Designing peer-to-peer overlays a small-world perspective. Girdzijauskas. Sarunas. 2009. epfl.ch. EPFL. 10.5075/epfl-thesis-4327 . 2019-11-11. 2020-03-03. https://web.archive.org/web/20200303182938/https://infoscience.epfl.ch/record/130838?ln=en. live.
Gurmeet Singh Manku, Moni Naor, and Udi Wieder. "Know thy Neighbor's Neighbor: the Power of Lookahead in Randomized P2P Networks" . Proc. STOC, 2004.
Web site: Ali Ghodsi. Distributed k-ary System: Algorithms for Distributed Hash Tables . https://web.archive.org/web/20070522060750/http://www.sics.se/~ali/thesis/ . 22 May 2007. 22 May 2007 . dead. . KTH-Royal Institute of Technology, 2006.
Castro . Miguel . Costa . Manuel . Rowstron . Antony . Should we build Gnutella on a structured overlay? . ACM SIGCOMM Computer Communication Review . 1 January 2004 . 34 . 1 . 131 . 10.1145/972374.972397 . 10.1.1.221.7892 . 6587291 . 25 September 2019 . 14 February 2021 . https://web.archive.org/web/20210214011937/http://nms.lcs.mit.edu/HotNets-II/papers/structella.pdf . live .
Talia . Domenico . Trunfio . Paolo . Enabling Dynamic Querying over Distributed Hash Tables . Journal of Parallel and Distributed Computing . December 2010 . 70 . 12 . 1254–1265 . 10.1016/j.jpdc.2010.08.012 .
Baruch Awerbuch, Christian Scheideler."Towards a scalable and robust DHT".2006.
Maxwell Young; Aniket Kate; Ian Goldberg; Martin Karsten."Practical Robust Communication in DHTs Tolerating a Byzantine Adversary" .
Natalya Fedotova; Giordano Orzetti; Luca Veltri; Alessandro Zaccagnini."Byzantine agreement for reputation management in DHT-based peer-to-peer networks".
Whanau: A Sybil-proof Distributed Hash Tablehttps://pdos.csail.mit.edu/papers/whanau-nsdi10.pdf
Book: Lesniewski-Laas, Chris . A Sybil-proof one-hop DHT . 2008-04-01 . Proceedings of the 1st Workshop on Social Network Systems . https://dl.acm.org/doi/10.1145/1435497.1435501 . SocialNets '08 . New York, NY, USA . Association for Computing Machinery . 19–24 . 10.1145/1435497.1435501 . 978-1-60558-124-8.
Kelner . Jonathan . Maymounkov . Petar . 2011-07-22 . Electric routing and concurrent flow cutting . Theoretical Computer Science . Algorithms and Computation . 412 . 32 . 4123–4135 . 10.1016/j.tcs.2010.06.013 . 0304-3975. 1721.1/71604 . free .
Book: Sequential and Parallel Algorithms and Data Structures: The Basic Toolbox. Sanders. Peter. Mehlhorn. Kurt. Dietzfelbinger. Martin. Dementiev. Roman. 2019. Springer International Publishing. 978-3-030-25208-3. en. 2020-01-22. 2021-08-17. https://web.archive.org/web/20210817105142/https://www.springer.com/gp/book/9783030252083. live.
http://www.tribler.org/trac/wiki/Khashmir Tribler wiki
http://retroshare.sourceforge.net/wiki/index.php/Frequently_Asked_Questions#4-1_How_does_RetroShare_know_my_friend.27s_IP_address_and_port.3F_Why_don.27t_I_need_a_static_IP_address.3F_What_is_DHT_for.3F Retroshare FAQ