Part IV › Chapter 10

Kolmogorov Complexity

The information content of an individual string — defined as the length of its shortest description. Uncomputable, yet the foundation of algorithmic information theory.

10.1 Definition

Fix a universal Turing machine \(U\). The Kolmogorov complexity (also called algorithmic or descriptional complexity) of a binary string \(x\) is:

\[ K(x) = \min\bigl\{|p| : U(p) = x\bigr\} \]

length of shortest binary program that makes \(U\) output \(x\)

Intuitively: how many bits are needed to describe \(x\) completely? A string like \(\underbrace{00\cdots0}_{1000}\) can be described by a short program (“print 1000 zeros”), so \(K(x) \ll |x|\). A genuinely random string has no shorter description than itself, so \(K(x) \approx |x|\).

Prefix-free version: One often uses the prefix-free Kolmogorov complexity \(K^*(x)\) where only self-delimiting programs are allowed (analogous to prefix-free codes). This satisfies the Kraft inequality\(\sum_x 2^{-K^*(x)} \le 1\), making it closer to a probability measure.

10.2 Invariance Theorem

Does \(K(x)\) depend on the choice of reference machine \(U\)? The Invariance Theorem says no — up to a constant:

\[ K_U(x) \le K_V(x) + c_{U,V} \]

where \(c_{U,V}\) is a constant depending only on the two machines, not on \(x\)

Proof: Given a program \(p_V\) that computes\(x\) on \(V\), we can prepend a fixed interpreter \(I_{V\to U}\)that simulates \(V\) on \(U\): the combined program \(I_{V\to U} p_V\)runs on \(U\) and has length \(|I_{V\to U}| + |p_V|\) where\(|I_{V\to U}|\) is the fixed constant \(c_{U,V}\).

The invariance theorem justifies speaking of the Kolmogorov complexity — it is well-defined up to a fixed additive constant. For large strings, this constant is negligible.

10.3 Uncomputability

Theorem: There is no algorithm that computes\(K(x)\) for all \(x\).

Proof (Berry paradox variant): Suppose\(K\) were computable. Consider the string:

“the lexicographically first string \(x\) with \(K(x) > n\)”

This description has length \(O(\log n)\) bits (encoding \(n\)) — but by definition the string has \(K > n\). For large enough \(n\),\(O(\log n) < n\), a contradiction. Hence \(K\) is uncomputable.

Practically: we can approximate \(K(x)\) from above by compression algorithms (zlib, bz2, lzma). These give upper bounds: if a compressor produces a \(k\)-byte output, then \(K(x) \le k + c\) where \(c\) is the decompressor size.

10.4 Relation to Shannon Entropy

For a random variable \(X\) drawn i.i.d. with distribution \(P\):

\[ \mathbb{E}[K(X)] = H(X) + O(1) \]

More precisely, for a sequence of \(n\) i.i.d. samples:

\[ \frac{1}{n}\mathbb{E}[K(X^n)] \xrightarrow{n\to\infty} H(X) \]

Shannon Entropy

‣ Defined for probability distributions
‣ Average over many outcomes
‣ Computable
‣ Distribution-dependent

Kolmogorov Complexity

‣ Defined for individual strings
‣ Complexity of a single object
‣ Uncomputable (only approximable)
‣ Distribution-free / objective

10.5 Random Strings

A string \(x\) of length \(n\) is called algorithmically random (orKolmogorov random) if \(K(x) \ge n - O(1)\) — there is no significantly shorter description.

Key facts:

‣At least \(2^n - 2^{n-c+1}\) strings of length \(n\) have\(K(x) \ge n-c\). Almost all strings are random.
‣Random strings have no exploitable pattern: they pass all effective statistical tests. Martin-Löf randomness coincides with Kolmogorov randomness.
‣You can never prove a specific string is Kolmogorov random (uncomputability), yet almost all strings are. This is analogous to Gödel incompleteness.

Python: Estimating Complexity via Compression

Uses zlib, bz2, and lzma as upper-bound estimators for \(K(x)\). Tests 8 string types (zeros, alternating, Fibonacci word, powers of 2, English text, source code, random bits, random DNA) of length 1000. Four plots: compressed sizes, compression ratios, K vs length for random/structured, and distributions of K over 500 strings of each type.

Python

script.py195 lines

import zlib
import bz2
import lzma
import random
import string
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

plt.style.use('dark_background')

# ── Complexity estimator via compression ───────────────────────────────────
def complexity_zlib(s: bytes) -> int:
    return len(zlib.compress(s, level=9))

def complexity_bz2(s: bytes) -> int:
    return len(bz2.compress(s))

def complexity_lzma(s: bytes) -> int:
    return len(lzma.compress(s, preset=9))

def to_bytes(s: str) -> bytes:
    return s.encode('utf-8')

# ── Test strings ────────────────────────────────────────────────────────────
strings = {
    'All zeros (n=1000)':        '0' * 1000,
    'Alternating 01 (n=1000)':   '01' * 500,
    'Fibonacci words (n≈1000)':  '',  # generated below
    'Powers of 2 (n=1000)':      '',  # generated below
    'English text (n=1000)':     ('The quick brown fox jumps over the lazy dog. ' * 24)[:1000],
    'Source code (n=1000)':      ('for i in range(1000):
    print(i*i)
' * 40)[:1000],
    'Uniform random (n=1000)':   '',  # generated below
    'Random DNA (n=1000)':       '',  # generated below
}

# Fibonacci word: F(1)='1', F(2)='0', F(n)=F(n-1)+F(n-2)
a, b = '1', '0'
while len(b) < 1100:
    a, b = b, b + a
strings['Fibonacci words (n≈1000)'] = b[:1000]

# Powers of 2 as digit string
pow2 = ''.join(str(2**i) for i in range(1000))[:1000]
strings['Powers of 2 (n=1000)'] = pow2

rng = random.Random(42)
strings['Uniform random (n=1000)'] = ''.join(rng.choice('01') for _ in range(1000))
strings['Random DNA (n=1000)'] = ''.join(rng.choice('ACGT') for _ in range(1000))

# Compute complexities
results = {}
for name, s in strings.items():
    raw = len(to_bytes(s))
    z   = complexity_zlib(to_bytes(s))
    bz  = complexity_bz2(to_bytes(s))
    lz  = complexity_lzma(to_bytes(s))
    ratio = z / raw
    results[name] = {'raw': raw, 'zlib': z, 'bz2': bz, 'lzma': lz, 'ratio': ratio}
    print(f"{name[:40]:<42}  raw={raw}  zlib={z:4d}  bz2={bz:4d}  lzma={lz:4d}  ratio={ratio:.3f}")

# ── Plot ───────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(2, 2, figsize=(13, 10))
fig.patch.set_facecolor('#0a0a0f')
for ax in axes.flat:
    ax.set_facecolor('#0d0d1a')

names = list(results.keys())
short_names = [n.split('(')[0].strip() for n in names]
raw_sizes   = [results[n]['raw']   for n in names]
zlib_sizes  = [results[n]['zlib']  for n in names]
bz2_sizes   = [results[n]['bz2']   for n in names]
lzma_sizes  = [results[n]['lzma']  for n in names]
ratios      = [results[n]['ratio'] for n in names]

# Sort by zlib complexity
order = np.argsort(zlib_sizes)
short_names_s = [short_names[i] for i in order]
zlib_s   = [zlib_sizes[i]  for i in order]
bz2_s    = [bz2_sizes[i]   for i in order]
lzma_s   = [lzma_sizes[i]  for i in order]
raw_s    = [raw_sizes[i]   for i in order]
ratio_s  = [ratios[i]      for i in order]

y_pos = np.arange(len(names))

# 1. Compressed sizes (horizontal bar)
ax = axes[0, 0]
ax.barh(y_pos - 0.22, zlib_s,  0.22, color='#818cf8', alpha=0.85, label='zlib')
ax.barh(y_pos,        bz2_s,   0.22, color='#34d399', alpha=0.85, label='bz2')
ax.barh(y_pos + 0.22, lzma_s,  0.22, color='#fbbf24', alpha=0.85, label='lzma')
ax.barh(y_pos,        raw_s,   0.0,  color='#f87171', alpha=0, label='raw length')
# raw line
for i, r in enumerate(raw_s):
    ax.plot([r, r], [i-0.35, i+0.35], color='#f87171', lw=1.5, ls='--', alpha=0.7)
ax.set_yticks(y_pos); ax.set_yticklabels(short_names_s, color='#9ca3af', fontsize=8)
ax.set_xlabel('Compressed size (bytes)', color='#9ca3af')
ax.set_title('Compression = Complexity Estimate', color='#a5b4fc', fontweight='bold')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#374151', labelcolor='white')
ax.tick_params(colors='#6b7280')
for spine in ax.spines.values(): spine.set_edgecolor('#1e293b')

# 2. Compression ratio
ax = axes[0, 1]
colors_ratio = ['#34d399' if r < 0.5 else '#fbbf24' if r < 0.8 else '#f87171'
                for r in ratio_s]
bars = ax.barh(y_pos, ratio_s, 0.55, color=colors_ratio, alpha=0.85)
ax.axvline(1.0, color='#f87171', lw=1.5, ls='--', label='ratio=1 (incompressible)')
ax.set_yticks(y_pos); ax.set_yticklabels(short_names_s, color='#9ca3af', fontsize=8)
ax.set_xlabel('Compression ratio  (compressed/raw)', color='#9ca3af')
ax.set_title('Compression Ratio ≈ Normalized K(x)', color='#a5b4fc', fontweight='bold')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#374151', labelcolor='white')
for bar, val in zip(bars, ratio_s):
    ax.text(val + 0.01, bar.get_y() + bar.get_height()/2, f'{val:.2f}',
            va='center', color='white', fontsize=8)
ax.tick_params(colors='#6b7280')
for spine in ax.spines.values(): spine.set_edgecolor('#1e293b')

# 3. K(x) vs length for random strings of increasing length
ax = axes[1, 0]
lengths = [50, 100, 200, 400, 800, 1600, 3200, 6400]
rng2 = random.Random(99)

k_random = []
k_zeros  = []
k_altbit = []

for n in lengths:
    rand_s  = ''.join(rng2.choice('01') for _ in range(n)).encode()
    zeros_s = (b'0' * n)
    alt_s   = (b'01' * (n//2))
    k_random.append(complexity_zlib(rand_s))
    k_zeros.append(complexity_zlib(zeros_s))
    k_altbit.append(complexity_zlib(alt_s))

ax.plot(lengths, lengths, color='#6b7280', lw=1.5, ls='--', alpha=0.7, label='K = n  (diagonal)')
ax.plot(lengths, k_random, color='#f87171', lw=2.5, marker='o', ms=5, label='Random  K≈n')
ax.plot(lengths, k_altbit, color='#fbbf24', lw=2.5, marker='s', ms=5, label='Alternating  K≪n')
ax.plot(lengths, k_zeros,  color='#34d399', lw=2.5, marker='^', ms=5, label='All zeros  K≈O(log n)')
ax.set_xlabel('String length n', color='#9ca3af')
ax.set_ylabel('K(x) estimate (bytes)', color='#9ca3af')
ax.set_title('K(x) vs Length: random vs structured', color='#a5b4fc', fontweight='bold')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#374151', labelcolor='white')
ax.tick_params(colors='#6b7280')
for spine in ax.spines.values(): spine.set_edgecolor('#1e293b')

# 4. Distribution of complexities for truly random vs structured strings
ax = axes[1, 1]
n_strings = 500
length_fixed = 500
rng3 = random.Random(7)

k_random_dist = []
k_structured_dist = []
k_mixed_dist = []

for _ in range(n_strings):
    # Truly random
    r = ''.join(rng3.choice('01') for _ in range(length_fixed)).encode()
    k_random_dist.append(complexity_zlib(r))

# Structured: random period repeated
    period_len = rng3.randint(1, 20)
    period = ''.join(rng3.choice('01') for _ in range(period_len))
    struct = (period * (length_fixed // period_len + 1))[:length_fixed].encode()
    k_structured_dist.append(complexity_zlib(struct))

# Mixed: random first half, zeros second
    mixed = (''.join(rng3.choice('01') for _ in range(length_fixed//2)) +
             '0' * (length_fixed - length_fixed//2)).encode()
    k_mixed_dist.append(complexity_zlib(mixed))

bins = np.linspace(20, length_fixed + 30, 50)
ax.hist(k_random_dist,     bins=bins, color='#f87171', alpha=0.7, label='Random strings')
ax.hist(k_mixed_dist,      bins=bins, color='#fbbf24', alpha=0.7, label='Mixed')
ax.hist(k_structured_dist, bins=bins, color='#34d399', alpha=0.7, label='Periodic structured')
ax.set_xlabel('K(x) estimate (bytes)', color='#9ca3af')
ax.set_ylabel('Count', color='#9ca3af')
ax.set_title('Distribution of K(x) over 500 strings', color='#a5b4fc', fontweight='bold')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#374151', labelcolor='white')
ax.tick_params(colors='#6b7280')
for spine in ax.spines.values(): spine.set_edgecolor('#1e293b')

fig.suptitle('Chapter 10: Kolmogorov Complexity via Compression', color='#a5b4fc',
             fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight',
            facecolor='#0a0a0f', edgecolor='none')
print("\nSaved output.png")
print("\nKey insight: random strings cannot be compressed (ratio ≈ 1)")
print("Structured strings compress well → low Kolmogorov complexity")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

← MIMO & Water-Filling Next: MDL →

Share:X Reddit LinkedIn