TechStar Asia - Tech News for Builders and Operators

I've been experimenting with piping raw git diff output into LLMs for automated security review, and I wanted to share what I've learned because some of the results surprised me.

The problem that started this

A teammate refactored a SQL query from string concatenation to an f-string. The diff looked like an improvement:

-    query = "SELECT * FROM users WHERE id = " + user_id
+    query = f"SELECT * FROM users WHERE id = {user_id}"

Three reviewers approved it. It looked cleaner. But the vulnerability was identical — both are injection vectors. The cosmetic improvement actually made it harder to catch because it looked like the dev was modernizing the code.

This is the kind of thing that made me think: can an LLM reliably detect that a diff looks like a fix but isn't?

The architecture

I built a FastAPI service that accepts a raw diff string and returns structured JSON with severity-classified security and quality issues. Here are the key design decisions I made and why.

Why diffs and not full files?

Full-file analysis is what SAST tools already do well. Diffs are interesting because they capture intent — what the developer was trying to change. An LLM can reason about the gap between what a change appears to do and what it actually does. That's the niche.

Why structured JSON output instead of free text?

Because the output needs to be machine-consumable. If you want to post a PR comment, block a merge, or feed results into a dashboard, you need parseable severity levels and categories — not a paragraph of prose.

The API has three endpoints:

POST /review — returns security and quality issues with severity levels
POST /commit-message — generates a Conventional Commits message from the diff
POST /changelog — generates a Keep a Changelog formatted entry

Here's what the /review endpoint returns for the SQL injection example above:

{
  "security_issues": [
    {
      "severity": "CRITICAL",
      "description": "SQL query built with f-string interpolation is vulnerable to SQL injection. Use parameterized queries.",
      "line_hint": "f\"SELECT * FROM users WHERE id = {user_id}\""
    }
  ],
  "quality_issues": [
    {
      "severity": "MEDIUM",
      "description": "Refactored from concatenation to f-string but the core vulnerability remains. This is a cosmetic change, not a security fix."
    }
  ],
  "summary": "Critical: SQL injection persists after refactor. The change is cosmetic."
}

Each issue has a severity (CRITICAL, HIGH, MEDIUM, LOW, INFO), a description, and a line hint pointing to the relevant code. The separation between security and quality issues lets you handle them differently in your pipeline — maybe security issues block the merge, but quality issues are just warnings.

Why DeepSeek instead of GPT-4?

Cost. DeepSeek runs at roughly $0.0004 per request, which makes it viable to offer a free tier that's actually usable. I haven't done a rigorous side-by-side comparison, but anecdotally GPT-4 catches more subtle logic bugs while DeepSeek handles pattern-based security issues (injection, XSS, hardcoded secrets) well enough for a first-pass filter.

Why truncate diffs to 8,000 characters?

Two reasons: (1) sending very large diffs produces worse analysis — the model loses focus when there's too much context, and (2) if your diff is that large, an automated tool probably isn't the right first step. The truncation is automatic — the API takes whatever you send and clips it if needed.

What it handles well

Based on my testing so far, it reliably flags:

SQL injection (concatenation, f-strings, format strings)
XSS via unsanitized DOM insertion (innerHTML, template literals)
Hardcoded secrets and API keys in source code
Command injection (shell=True, os.system)

The most interesting behavior is what I call the "cosmetic refactor" pattern — when a diff looks like it's fixing something but the underlying vulnerability is unchanged. The LLM seems to handle this well because it's reasoning about what changed rather than just scanning for dangerous patterns.

What it doesn't catch (and probably can't)

Being honest about the limitations:

Business logic flaws — the model has no context about what your app is supposed to do
Race conditions — diffs don't contain concurrency context
Dependency vulnerabilities — it only sees your code changes, not your supply chain
Subtle type confusion bugs — LLMs aren't compilers
Anything requiring full codebase context — the model only sees the diff, not the file it belongs to

This means it won't catch a vulnerability that's introduced by the interaction between the changed code and existing code it can't see.

Where it fits in a pipeline

This is a pre-review filter, not a SAST replacement. Think of it as a layer that catches the obvious stuff before a human reviewer spends their time on it.

Some ways you could use it:

Pre-commit hook: run git diff --cached through /review before allowing a commit
CI/CD gate: call /review in a GitHub Action and post results as a PR comment
Changelog automation: call /changelog on merge to auto-update CHANGELOG.md
Commit message generation: pipe your staged diff through /commit-message to stop writing "fix stuff"

Here's what a basic integration looks like:

DIFF=$(git diff HEAD~1)
curl -X POST https://diffsense.p.rapidapi.com/review \
  -H "Content-Type: application/json" \
  -H "X-RapidAPI-Key: YOUR_KEY" \
  -d "{\"diff\": \"$DIFF\"}"

Or in Python:

import requests

diff = open("my.patch").read()
response = requests.post(
    "https://diffsense.p.rapidapi.com/review",
    headers={
        "Content-Type": "application/json",
        "X-RapidAPI-Key": "YOUR_KEY"
    },
    json={"diff": diff}
)
print(response.json())

The bigger insight

The most valuable thing I've learned building this: LLMs are better at analyzing changes than analyzing code. A diff is a narrative — "someone tried to do X" — and LLMs are good at narratives. Static code is just structure, and traditional tools handle structure better.

That's why I think diff-level analysis is a genuine niche, not just "yet another AI code review tool." It's not competing with SAST — it's covering a different surface.

Try it / read the code

The implementation is open source if you want to look at the FastAPI structure or the prompt engineering: github.com/diffsense/diffsense-api

There's also a hosted version on RapidAPI with a free tier (50 requests/month): rapidapi.com/terrycrews99/api/diffsense

I'm actively developing this and would genuinely appreciate feedback — what patterns would you want caught at the diff level? What would make this useful enough to add to your pipeline?

Using LLMs to do security analysis at the git diff level — what works, what doesn't, and why structured output matters