I've been experimenting with piping raw git diff output into LLMs for automated security review, and I wanted to share what I've learned because some of the results surprised me.
The problem that started this
A teammate refactored a SQL query from string concatenation to an f-string. The diff looked like an improvement:
- query = "SELECT * FROM users WHERE id = " + user_id
+ query = f"SELECT * FROM users WHERE id = {user_id}"
Three reviewers approved it. It looked cleaner. But the vulnerability was identical — both are injection vectors. The cosmetic improvement actually made it harder to catch because it looked like the dev was modernizing the code.
This is the kind of thing that made me think: can an LLM reliably detect that a diff looks like a fix but isn't?
The architecture
I built a FastAPI service that accepts a raw diff string and returns structured JSON with severity-classified security and quality issues. Here are the key design decisions I made and why.
Why diffs and not full files?
Full-file analysis is what SAST tools already do well. Diffs are interesting because they capture intent — what the developer was trying to change. An LLM can reason about the gap between what a change appears to do and what it actually does. That's the niche.
Why structured JSON output instead of free text?
Because the output needs to be machine-consumable. If you want to post a PR comment, block a merge, or feed results into a dashboard, you need parseable severity levels and categories — not a paragraph of prose.
The API has three endpoints:
-
POST /review— returns security and quality issues with severity levels -
POST /commit-message— generates a Conventional Commits message from the diff -
POST /changelog— generates a Keep a Changelog formatted entry
Here's what the /review endpoint returns for the SQL injection example above:
{
"security_issues": [
{
"severity": "CRITICAL",
"description": "SQL query built with f-string interpolation is vulnerable to SQL injection. Use parameterized queries.",
"line_hint": "f\"SELECT * FROM users WHERE id = {user_id}\""
}
],
"quality_issues": [
{
"severity": "MEDIUM",
"description": "Refactored from concatenation to f-string but the core vulnerability remains. This is a cosmetic change, not a security fix."
}
],
"summary": "Critical: SQL injection persists after refactor. The change is cosmetic."
}
Each issue has a severity (CRITICAL, HIGH, MEDIUM, LOW, INFO), a description, and a line hint pointing to the relevant code. The separation between security and quality issues lets you handle them differently in your pipeline — maybe security issues block the merge, but quality issues are just warnings.
Why DeepSeek instead of GPT-4?
Cost. DeepSeek runs at roughly $0.0004 per request, which makes it viable to offer a free tier that's actually usable. I haven't done a rigorous side-by-side comparison, but anecdotally GPT-4 catches more subtle logic bugs while DeepSeek handles pattern-based security issues (injection, XSS, hardcoded secrets) well enough for a first-pass filter.
Why truncate diffs to 8,000 characters?
Two reasons: (1) sending very large diffs produces worse analysis — the model loses focus when there's too much context, and (2) if your diff is that large, an automated tool probably isn't the right first step. The truncation is automatic — the API takes whatever you send and clips it if needed.
What it handles well
Based on my testing so far, it reliably flags:
- SQL injection (concatenation, f-strings, format strings)
- XSS via unsanitized DOM insertion (
innerHTML, template literals) - Hardcoded secrets and API keys in source code
- Command injection (
shell=True,os.system)
The most interesting behavior is what I call the "cosmetic refactor" pattern — when a diff looks like it's fixing something but the underlying vulnerability is unchanged. The LLM seems to handle this well because it's reasoning about what changed rather than just scanning for dangerous patterns.
What it doesn't catch (and probably can't)
Being honest about the limitations:
- Business logic flaws — the model has no context about what your app is supposed to do
- Race conditions — diffs don't contain concurrency context
- Dependency vulnerabilities — it only sees your code changes, not your supply chain
- Subtle type confusion bugs — LLMs aren't compilers
- Anything requiring full codebase context — the model only sees the diff, not the file it belongs to
This means it won't catch a vulnerability that's introduced by the interaction between the changed code and existing code it can't see.
Where it fits in a pipeline
This is a pre-review filter, not a SAST replacement. Think of it as a layer that catches the obvious stuff before a human reviewer spends their time on it.
Some ways you could use it:
-
Pre-commit hook: run
git diff --cachedthrough/reviewbefore allowing a commit -
CI/CD gate: call
/reviewin a GitHub Action and post results as a PR comment -
Changelog automation: call
/changelogon merge to auto-update CHANGELOG.md -
Commit message generation: pipe your staged diff through
/commit-messageto stop writing "fix stuff"
Here's what a basic integration looks like:
DIFF=$(git diff HEAD~1)
curl -X POST https://diffsense.p.rapidapi.com/review \
-H "Content-Type: application/json" \
-H "X-RapidAPI-Key: YOUR_KEY" \
-d "{\"diff\": \"$DIFF\"}"
Or in Python:
import requests
diff = open("my.patch").read()
response = requests.post(
"https://diffsense.p.rapidapi.com/review",
headers={
"Content-Type": "application/json",
"X-RapidAPI-Key": "YOUR_KEY"
},
json={"diff": diff}
)
print(response.json())
The bigger insight
The most valuable thing I've learned building this: LLMs are better at analyzing changes than analyzing code. A diff is a narrative — "someone tried to do X" — and LLMs are good at narratives. Static code is just structure, and traditional tools handle structure better.
That's why I think diff-level analysis is a genuine niche, not just "yet another AI code review tool." It's not competing with SAST — it's covering a different surface.
Try it / read the code
The implementation is open source if you want to look at the FastAPI structure or the prompt engineering: github.com/diffsense/diffsense-api
There's also a hosted version on RapidAPI with a free tier (50 requests/month): rapidapi.com/terrycrews99/api/diffsense
I'm actively developing this and would genuinely appreciate feedback — what patterns would you want caught at the diff level? What would make this useful enough to add to your pipeline?
United States
NORTH AMERICA
Related News
CBS News Shutters Radio Service After Nearly a Century
3h ago
Officer Leaks Location of French Aircraft Carrier With Strava Run
3h ago
White House Unveils National AI Policy Framework To Limit State Power
3h ago
Microsoft Says It Is Fixing Windows 11
3h ago
Can Private Space Companies Replace the ISS Before 2030?
3h ago