／var／log marcus chiu

❯

❯

Artificial Intelligence (AI) - Cognitive Computing - Machine Intelligence

❯

❯

Natural Language Processing (NLP) - Computational Linguistics

❯

Language Models

❯

Large Language Models (LLMs)

LLM Interpretability

Created on Aug 28, 2025 · Last Modified on Sep 02, 2025

LLM Interpretability

is the field focused on understanding the internal workings of large language models (LLMs) to reveal their reasoning processes, decision-making mechanisms, and the concepts they use to generate responses
use techniques like mechanistic interpretability to map features to internal components, aiming to improve AI safety, fairness, alignment, and the reliability of LLMs by moving them from “black boxes” to more transparent systems

Tools

Sparse Autoencoders (SAE)