Computer Science · Machine Learning · Machine Learning Metrics
Hamming Distance Calculator
Calculates the Hamming distance between two binary strings or equal-length sequences by counting the number of positions where the corresponding symbols differ.
Calculator
Formula
d_H(x, y) is the Hamming distance between strings x and y. n is the length of each string (both must be equal length). x_i and y_i are the symbols at position i in strings x and y respectively. The indicator function 1[x_i ≠ y_i] equals 1 when the symbols differ and 0 when they are the same. The result is the total count of positions where x and y disagree.
Source: Hamming, R.W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147–160.
How it works
The Hamming distance between two strings of equal length is computed by scanning both strings symbol by symbol and incrementing a counter each time the characters at the same position are different. For binary strings this is equivalent to counting the number of bit flips required to transform one codeword into another, which directly corresponds to the number of single-bit transmission errors that have occurred. The metric satisfies all the axioms of a mathematical distance: it is non-negative, equals zero only when the strings are identical, is symmetric, and obeys the triangle inequality.
Formally, given two strings x and y of length n, the Hamming distance is d_H(x, y) = Σ 1[x_i ≠ y_i] for i from 1 to n. The normalized Hamming distance divides the raw count by the string length n, producing a value between 0 and 1 that is comparable across strings of different lengths. A normalized distance of 0 means the strings are identical, while a value of 1 means every symbol differs. The complementary similarity score, expressed as a percentage, equals (1 − normalized distance) × 100.
In coding theory, the minimum Hamming distance of a code (the smallest distance between any two distinct codewords) determines its error-detecting and error-correcting capabilities: a code with minimum distance d can detect up to d − 1 errors and correct up to ⌊(d − 1)/2⌋ errors. In machine learning, Hamming distance serves as a proximity measure for binary feature vectors in k-nearest-neighbor classifiers, locality-sensitive hashing, and the evaluation of multi-label classification models. In cryptography, a large Hamming distance between plaintext and ciphertext blocks is a key indicator of the diffusion property in block ciphers.
Worked example
Suppose you want to find the Hamming distance between the two 8-bit binary strings 10110100 and 11010110.
Align the strings position by position:
Position 1: 1 vs 1 → match
Position 2: 0 vs 1 → differ (+1)
Position 3: 1 vs 0 → differ (+1)
Position 4: 1 vs 1 → match
Position 5: 0 vs 0 → match
Position 6: 1 vs 1 → match
Position 7: 0 vs 1 → differ (+1)
Position 8: 0 vs 0 → match
Total differing positions: 3. The Hamming distance is therefore d_H = 3.
The normalized Hamming distance is 3 / 8 = 0.375, and the similarity is (1 − 0.375) × 100 = 62.50%.
In a coding-theory context, if these were two codewords in a binary block code, the minimum distance of 3 would mean the code can detect up to 2 single-bit errors and correct up to 1 single-bit error — exactly the property of a Hamming(7,4) or similar code.
Limitations & notes
The Hamming distance is only defined for strings of equal length; it cannot be applied directly to sequences that differ in size (use Levenshtein distance for variable-length strings). It treats all symbol mismatches as equally costly, which may not be appropriate for non-binary alphabets where some substitutions are more meaningful than others. For continuous-valued feature vectors, Euclidean or cosine distance is generally more informative. When comparing DNA sequences or protein chains, biologically motivated substitution matrices (e.g., BLOSUM) often provide better results than raw Hamming counts. Finally, Hamming distance does not capture positional context or runs of errors, so it may underestimate the impact of burst errors in some communication applications.
Frequently asked questions
What is the difference between Hamming distance and Levenshtein distance?
Hamming distance counts positions where two equal-length strings differ and does not allow insertions or deletions. Levenshtein (edit) distance counts the minimum number of single-character insertions, deletions, or substitutions needed to transform one string into another and works on strings of unequal length. Use Hamming for fixed-length codes and Levenshtein for general text comparison.
Can Hamming distance be used for non-binary strings?
Yes. The definition applies to any alphabet — you simply count positions where the two strings have different symbols. For example, comparing DNA sequences over the alphabet {A, T, G, C} or comparing decimal digit strings are both valid applications. The formula is unchanged; only the symbol set differs.
How is Hamming distance used in error correction?
A binary code with minimum Hamming distance d_min can detect up to d_min − 1 bit errors and correct up to ⌊(d_min − 1) / 2⌋ bit errors. For example, a code with d_min = 3 (like the classic Hamming(7,4) code) can detect 2-bit errors and correct any single-bit error by mapping a received word to the nearest valid codeword.
What does a normalized Hamming distance of 0.5 mean?
A normalized Hamming distance of 0.5 means exactly half of the positions in the two strings differ. For a random pair of binary strings, the expected normalized Hamming distance is 0.5, so values near 0.5 suggest the strings share no more similarity than random chance would produce.
How is Hamming distance applied in machine learning?
In machine learning, Hamming distance is used as a similarity metric for binary feature vectors in k-nearest-neighbor classification, locality-sensitive hashing for approximate nearest-neighbor search, and evaluating multi-label classifiers where each label is a binary variable. It is also used in genetic algorithms to measure the diversity of binary-encoded candidate solutions in a population.
Last updated: 2025-01-15 · Formula verified against primary sources.