Want to Build Your Own Local CVE Database? This Is More or Less How We Do It
This is more or less how the Verimu vulnerability ingestion system handles CVE/advisory synchronization today.
Important context up front: this is the practical architecture we implement, with a few details still being refined as we harden it.
The requirement is simple:
- keep a local vulnerability dataset in Postgres,
- update it frequently,
- expose it programmatically to internal systems,
- keep infra lightweight.
No UI required.
The two artifacts we sync
For this setup, we model ingestion as:
allsnapshot (full baseline)updatedelta (incremental changes)
In OSV terms, that is:
all.zip(full dump),modified_id.csv(reverse-chronological change stream).
The OSV docs describe both endpoints and the modified_id.csv format in detail:
Strictly speaking, only one of these is a ZIP file. We still treat them as the same pattern operationally: full baseline + incremental updates.
Our cadence
- Weekly: refresh from full
all.zipbaseline - Hourly: process
modified_id.csvupdates and upsert changed records
Why this split:
- weekly full sync is a safety net against drift, missed jobs, or parser bugs,
- hourly deltas keep us current without reprocessing everything.
High-level data model
We keep this intentionally lean:
vulnerabilities(id,modified_at,published_at,withdrawn_at,summary)aliases(vuln_id,alias)affected_packages(vuln_id,ecosystem,package_name,purl,introduced,fixed)sync_state(key,last_modified)
That is enough for most API queries:
- "show latest CVEs affecting npm packages"
- "all vulns for package X"
- "map advisory IDs to CVE aliases"
Go snippet: poll the update stream
This is the key loop: read modified_id.csv from the top, stop when you hit your last watermark, fetch changed records, upsert.
package main
import (
"encoding/csv"
"net/http"
"time"
)
const modifiedCSV = "https://storage.googleapis.com/osv-vulnerabilities/modified_id.csv"
func syncOnce(lastSeen time.Time) (time.Time, error) {
resp, err := http.Get(modifiedCSV)
if err != nil {
return lastSeen, err
}
defer resp.Body.Close()
r := csv.NewReader(resp.Body)
newest := lastSeen
for {
row, err := r.Read()
if err != nil {
break
}
// row[0] = ISO timestamp, row[1] = "<ecosystem>/<id>"
ts, _ := time.Parse(time.RFC3339, row[0])
if !lastSeen.IsZero() && !ts.After(lastSeen) {
// CSV is reverse chronological, so we can stop early.
break
}
if ts.After(newest) {
newest = ts
}
// Example row[1]: "Go/GO-2026-4337"
// Fetch JSON: https://storage.googleapis.com/osv-vulnerabilities/Go/GO-2026-4337.json
_ = fetchAndUpsert(row[1] + ".json")
}
return newest, nil
}
Go snippet: baseline bootstrap from all.zip
On weekly refresh, we stream entries from all.zip and upsert each JSON object.
package main
import (
"archive/zip"
"bytes"
"io"
"net/http"
)
const allZipURL = "https://storage.googleapis.com/osv-vulnerabilities/all.zip"
func fullRefresh() error {
resp, err := http.Get(allZipURL)
if err != nil {
return err
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return err
}
zr, err := zip.NewReader(bytes.NewReader(body), int64(len(body)))
if err != nil {
return err
}
for _, f := range zr.File {
rc, err := f.Open()
if err != nil {
return err
}
if err := upsertFromJSON(rc); err != nil {
rc.Close()
return err
}
rc.Close()
}
return nil
}
In our implementation, we also layer in:
- streaming/decompression that avoids loading full ZIP into memory,
- bounded concurrency worker pools,
- retry + dead-letter handling for parse failures,
- structured metrics (
processed,updated,withdrawn,failed).
Handling "is this a CVE?" correctly
One common surprise: many records are not keyed by CVE-* at the top level.
Examples from real update streams:
Go/GO-2026-4337Red Hat/RHSA-2026:3970CleanStart/CLEANSTART-2026-VC01496
That is expected. OSV IDs are source-native. CVE IDs are usually in aliases when available.
Design implication:
- use
idas your primary key, - index aliases separately,
- never assume one record equals one CVE string.
Size planning (lean schema)
For a lean normalized schema, a practical planning range is:
- ~1 to 3 GB for core tables + indexes at current scale,
- more if you retain full raw JSON blobs and historical snapshots.
On any small linux box (8 GB RAM with 80GB+ and up), this is usually comfortable if you keep ingestion workers constrained and avoid unnecessary duplication.
What we're still improving
As the pipeline matures:
- optional NVD enrichment for CVSS/CWE overlays,
- optional source-priority logic when records conflict,
- package/ecosystem-specific materialized views for faster API queries.
The goal is reliability first, enrichment second.
References
- OSV data dumps (
all.zip,modified_id.csv, per-ecosystem paths) - OSV schema docs
- CVE List v5 repository
If you are building this for CRA timelines, the real win is not "having a big database." The win is having a sync process you can trust when you need to answer impact questions quickly - and in a programmatic fashion that doesn't require pouring through NVD web pages or waiting for a CVE to be published.
Get started free → or book a demo if you want a CRA-focused workflow without operating this ingestion stack yourself.