why most deep learning practitioners misunderstand model interpretability

hot take: interpretability in machine learning is overrated in the way it’s pursued. people obsess over explainability tools—SHAP, LIME, and others—that look flashy but rarely offer actionable insights. i’m going to argue that what most ML practitioners call “interpretability” is just confirmation bias dressed up in fancy visuals. what’s worse, many interpretability techniques are actually harming models rather than improving them.

interpretability isn’t the panacea

there’s this myth that we need interpretability to trust our models. but if we’re honest, most interpretability tools tell us what we already know or want to believe. the real problem is that we’re often trying to make inherently uninterpretable models (like deep neural networks) fit into this paradigm of easy-to-digest insights.

let’s get technical here. imagine you’re using SHAP values to explain a model’s predictions. SHAP’s a great tool for certain contexts, but in complex neural networks, it often produces “importance” scores that mislead more than they explain.

an example of SHAP values not matching reality

suppose we have a simple MLP on a regression task, and we want to interpret the effect of each input feature. let’s run a toy example and see how well SHAP values hold up.

 
import torch
 
import torch.nn as nn
 
import shap
 
import numpy as np
 
  
 
# define a simple model
 
class SimpleMLP(nn.Module):
 
    def __init__(self, input_size, hidden_size):
 
        super(SimpleMLP, self).__init__()
 
        self.fc1 = nn.Linear(input_size, hidden_size)
 
        self.fc2 = nn.Linear(hidden_size, 1)
 
    def forward(self, x):
 
        x = torch.relu(self.fc1(x))
 
        x = self.fc2(x)
 
        return x
 
  
 
# generating some synthetic data
 
np.random.seed(42)
 
X = np.random.randn(100, 5)
 
y = 3 * X[:, 0] + 2 * X[:, 1] - 1.5 * X[:, 2] + np.random.normal(0, 0.1, 100)
 
  
 
# converting to tensor
 
X_tensor = torch.tensor(X, dtype=torch.float32)
 
y_tensor = torch.tensor(y, dtype=torch.float32)
 
  
 
# training the model
 
model = SimpleMLP(input_size=5, hidden_size=10)
 
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
 
criterion = nn.MSELoss()
 
  
 
for epoch in range(500):
 
    optimizer.zero_grad()
 
    output = model(X_tensor).squeeze()
 
    loss = criterion(output, y_tensor)
 
    loss.backward()
 
    optimizer.step()
 
  
 
# shap explainer
 
explainer = shap.DeepExplainer(model, X_tensor)
 
shap_values = explainer.shap_values(X_tensor)
 
  
 
# plotting shap values
 
shap.summary_plot(shap_values, X)
 

the plot above will give you feature importance values for each input feature. but here’s the kicker: SHAP assumes additivity, which isn’t how deep neural networks actually operate. it forces a local linear approximation onto a nonlinear model, which might look like it’s giving meaningful explanations but is really just bending the model’s behavior into a rigid, oversimplified form.

in this case, SHAP tells us how “important” each feature is in isolation, but that’s hardly useful when features interact in complex ways. what if there’s a nonlinear dependency between X[0] and X[1]? SHAP won’t capture it adequately.

interpretability vs. robustness

instead of interpretability, practitioners should prioritize robustness. we need models that perform consistently under distributional shifts or adversarial conditions. ironically, the time and compute spent on SHAP or LIME could be better spent testing models under conditions they’re likely to fail.

what robustness testing actually looks like

robustness is simple to test if you have a decent pipeline. here’s a quick example:

 
from sklearn.model_selection import train_test_split
 
from sklearn.metrics import mean_squared_error
 
  
 
# split original data into train and test sets
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
  
 
# introduce a slight shift in test data
 
X_test_shifted = X_test + np.random.normal(0, 0.2, X_test.shape)
 
  
 
# test model performance on original and shifted data
 
model.eval()
 
with torch.no_grad():
 
    y_pred = model(torch.tensor(X_test, dtype=torch.float32)).squeeze().numpy()
 
    y_pred_shifted = model(torch.tensor(X_test_shifted, dtype=torch.float32)).squeeze().numpy()
 
  
 
print("MSE on original test data:", mean_squared_error(y_test, y_pred))
 
print("MSE on shifted test data:", mean_squared_error(y_test, y_pred_shifted))
 

this is what robustness testing actually looks like: intentionally perturbing data, testing how well the model holds up. instead of post-hoc interpretability tricks, robustness testing gives you a true sense of how your model will behave when real-world data gets noisy or shifts.

interpretability at the frontier: lessons from scaling large models

interpretability becomes much more critical as models scale. recent work from Anthropic, for instance, has shown how sparse autoencoders can extract meaningful, interpretable features from massive language models that were previously black boxes. they’re uncovering complex, latent structures in LLMs—structures that suggest there are actual, learnable “features” of language hidden within these layers.

While this is super useful research, Anthropic themselves aren’t that worried themselves. Their original charter was to not push the frontier of capabilities, and they did just that with Sonnet 3.5. In my opinion, this a victory for OpenAI’s “iterative deployment” strategy. LLMs are a lot less dangerous than we think, and we eliminate the remaining dangers by letting people use the models.

a final hot take

here’s the uncomfortable truth: interpretability, in the way it’s usually applied, is more about appeasing stakeholders than making better models. the majority of these tools provide just enough insight to reassure someone who doesn’t understand the nuances of the model anyway. if you want better models, focus on robustness, regularization, and calibration over interpretability.

in an ideal world, “interpretability” would mean making the model itself inherently more intuitive—through simpler architectures or architectures that have inherent modularity. but in the meantime, take the interpretability obsession with a grain of salt and put your effort into making sure your model is as solid and generalizable as possible.