Skip to article frontmatterSkip to article content

Information Theory: Introduction to Shannon Entropy

Overview

Information theory provides tools to quantify uncertainty and information content of data. Tools from information theory form the backbone of neural net loss functions which are used to train models. Cross-Entropy loss and KL Divergence are used extensively for this purpose, which are derived from information-theoretic formulations. Separately, many models predict outcomes using categorical probability distributions. The Shannon Entropy can be applied to these outputs to characterized how uncertain the model is about its prediction, which is important for evaluating models for risk-aware deployments. Finally, these tools can be applied during training to analyze the dynamics, by researchers at the cutting edge -- we give a brief overview of these recent contributions.

Shannon Entropy Formula

For a categorical probability vector P=[p1,p2,,pn]P = [p_1, p_2, \dots, p_n] where pi=1\sum p_i = 1:

H(P)=i=1npilog2piH(P) = -\sum_{i=1}^n p_i \log_2 p_i

We use base-2 logarithms to measure in bits. For ( p_i = 0 ), the term is defined as 0.

What is the Shannon Entropy? A Binary Entropy Example

Let’s simplify to a binary distribution, aka a coin flip: probabilities ( p ) and ( 1-p ). The entropy is:

[ H(p) = -p \log_2 p - (1-p) \log_2 (1-p) ]

This reaches a maximum of 1 bit at ( p = 0.5 ) (complete uncertainty) and 0 at the extremes (certainty). In simple terms: a fair coin has 1 bit of entropy. A weighted coin that always comes up heads reveals no additional information when its outcome is observed.

What is the Shannon Entropy? A Lottery Ticket Example

A coin flip may be low stakes (or not)

no country for old men image

We all might agree that predicting a winning lottery ticket would be tremendously valuable. Suppose a million unique lottery tickets are sold, and each has an equal probability of winning. If you know the winning ticket before the drawing occurs, how many more bits of information do you have?

Interactive Visualization

Use the slider below to adjust pp and see how entropy changes. (If not interactive, clone the repo and run locally.)

import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider
import math  # For log, but we'll use np for safety

def plot_entropy(p=0.5):
    if p == 0 or p == 1:
        entropy = 0.0
    else:
        entropy = -p * math.log2(p) - (1 - p) * math.log2(1 - p)
    
    fig, ax = plt.subplots(figsize=(4, 3))
    ax.bar(['Entropy'], [entropy], color='skyblue')
    ax.set_ylim(0, 1.1)
    ax.set_ylabel('Bits')
    ax.set_title(f'Shannon Entropy: {entropy:.2f} bits')
    plt.show()

interact(plot_entropy, p=FloatSlider(min=0.0, max=1.0, step=0.01, value=0.5));

Experiment: Slide p toward 0 or 1—what happens to Shannon entropy?

Relevance to Deep Learning

Loss function discussion -- cross entropy, KL divergence. Minimize the amount of additional information you need to predict the outcome

Advanced Applications: Kolmogorov-Sinai Entropy in Training Dynamics

Building on Shannon entropy, Kolmogorov-Sinai (KS) entropy asks a similar question. Consider a particle moving probabilistically and in discrete time. It has many possible surrounding hypercubes it could end up in at time t+1t+1. In this sense, what is the entropy of the particle?

We hope to show that:

  1. The loss landscape is what informs the prediction probabilities. Therefore, what we’ve learned about second-order optimizers can be combined with this research area.
  2. Via the second law of thermodynamics, the amount of information we learn during a single batch during SGD has the upper bound of the change in KS Entropy; multiple batches have a component
  3. IBM research suggests the highest-noise directions on the loss landscape from batch to batch are also the
  4. Muon optimizer has revolutionized training efficiency by descending under the spectral norm, decreasing the principal eigenvector.
  5. It suggests further research be done

KS Entropy: From Shannon to Trajectories

KS entropy is similar to Shannon entropy but applied over time-evolving states. Consider space partitioned into hypercubes (cells). A trajectory is a sequence of visited hypercubes. KS entropy asks: “Given the current hypercube, which one does the particle go to next?” It’s the supremum over partitions ξ of the entropy rate:

hKS=supξlimn1nH(ξn)h_{KS} = \sup_{\xi} \lim_{n \to \infty} \frac{1}{n} H(\xi^n)

Where ( H(\xi^n) ) is the Shannon entropy of the joint partition after n steps, measuring how much new information is generated per step.

Visualization: Flat Field vs. Sharp Well

To build intuition, compare two scenarios in a simplified 2D phase space (grid of hypercubes):

Use the interactive tool below: Toggle scenarios or adjust “sharpness” (bias strength). The bar shows approximate KS entropy (computed as Shannon entropy over next-step probabilities, approximating the rate for this partition).

Observe: In the flat field, uniform probabilities yield high entropy (~3 bits, log2(8)). In the sharp well, increasing sharpness concentrates probability, reducing entropy toward 0 (deterministic path).

Relating KS Entropy to Training Dynamics

In deep learning optimization, parameters evolve as a trajectory in high-dimensional space. KS entropy captures the “chaos” of this process:

For deeper dives, reference this paper on KS in neural activities: Paper # Example; replace with relevant

Exercises for Thesis Sharpening

  1. Simulate SGD on a toy loss surface: Compute approximate KS entropy from parameter trajectories.
  2. Compare h_{KS} in Adam vs. SGD—does adaptivity reduce chaos?