Want to Build Your Own Local CVE Database? This Is More or Less How We Do It

2026-03-10· Verimu

CVEOSVPostgreSQLGoVulnerability Management

This is more or less how the Verimu vulnerability ingestion system handles CVE/advisory synchronization today.

Important context up front: this is the practical architecture we implement, with a few details still being refined as we harden it.

The requirement is simple:

keep a local vulnerability dataset in Postgres,
update it frequently,
expose it programmatically to internal systems,
keep infra lightweight.

No UI required.

The two artifacts we sync

For this setup, we model ingestion as:

all snapshot (full baseline)
update delta (incremental changes)

In OSV terms, that is:

all.zip (full dump),
modified_id.csv (reverse-chronological change stream).

The OSV docs describe both endpoints and the modified_id.csv format in detail:

Strictly speaking, only one of these is a ZIP file. We still treat them as the same pattern operationally: full baseline + incremental updates.

Our cadence

Weekly: refresh from full all.zip baseline
Hourly: process modified_id.csv updates and upsert changed records

Why this split:

weekly full sync is a safety net against drift, missed jobs, or parser bugs,
hourly deltas keep us current without reprocessing everything.

High-level data model

We keep this intentionally lean:

vulnerabilities (id, modified_at, published_at, withdrawn_at, summary)
aliases (vuln_id, alias)
affected_packages (vuln_id, ecosystem, package_name, purl, introduced, fixed)
sync_state (key, last_modified)

That is enough for most API queries:

"show latest CVEs affecting npm packages"
"all vulns for package X"
"map advisory IDs to CVE aliases"

Go snippet: poll the update stream

This is the key loop: read modified_id.csv from the top, stop when you hit your last watermark, fetch changed records, upsert.

package main

import (
  "encoding/csv"
  "net/http"
  "time"
)

const modifiedCSV = "https://storage.googleapis.com/osv-vulnerabilities/modified_id.csv"

func syncOnce(lastSeen time.Time) (time.Time, error) {
  resp, err := http.Get(modifiedCSV)
  if err != nil {
    return lastSeen, err
  }
  defer resp.Body.Close()

  r := csv.NewReader(resp.Body)
  newest := lastSeen

  for {
    row, err := r.Read()
    if err != nil {
      break
    }

    // row[0] = ISO timestamp, row[1] = "<ecosystem>/<id>"
    ts, _ := time.Parse(time.RFC3339, row[0])
    if !lastSeen.IsZero() && !ts.After(lastSeen) {
      // CSV is reverse chronological, so we can stop early.
      break
    }

    if ts.After(newest) {
      newest = ts
    }

    // Example row[1]: "Go/GO-2026-4337"
    // Fetch JSON: https://storage.googleapis.com/osv-vulnerabilities/Go/GO-2026-4337.json
    _ = fetchAndUpsert(row[1] + ".json")
  }

  return newest, nil
}

Go snippet: baseline bootstrap from `all.zip`

On weekly refresh, we stream entries from all.zip and upsert each JSON object.

package main

import (
  "archive/zip"
  "bytes"
  "io"
  "net/http"
)

const allZipURL = "https://storage.googleapis.com/osv-vulnerabilities/all.zip"

func fullRefresh() error {
  resp, err := http.Get(allZipURL)
  if err != nil {
    return err
  }
  defer resp.Body.Close()

  body, err := io.ReadAll(resp.Body)
  if err != nil {
    return err
  }

  zr, err := zip.NewReader(bytes.NewReader(body), int64(len(body)))
  if err != nil {
    return err
  }

  for _, f := range zr.File {
    rc, err := f.Open()
    if err != nil {
      return err
    }
    if err := upsertFromJSON(rc); err != nil {
      rc.Close()
      return err
    }
    rc.Close()
  }
  return nil
}

In our implementation, we also layer in:

streaming/decompression that avoids loading full ZIP into memory,
bounded concurrency worker pools,
retry + dead-letter handling for parse failures,
structured metrics (processed, updated, withdrawn, failed).

Handling "is this a CVE?" correctly

One common surprise: many records are not keyed by CVE-* at the top level.

Examples from real update streams:

Go/GO-2026-4337
Red Hat/RHSA-2026:3970
CleanStart/CLEANSTART-2026-VC01496

That is expected. OSV IDs are source-native. CVE IDs are usually in aliases when available.

Design implication:

use id as your primary key,
index aliases separately,
never assume one record equals one CVE string.

Size planning (lean schema)

For a lean normalized schema, a practical planning range is:

~1 to 3 GB for core tables + indexes at current scale,
more if you retain full raw JSON blobs and historical snapshots.

On any small linux box (8 GB RAM with 80GB+ and up), this is usually comfortable if you keep ingestion workers constrained and avoid unnecessary duplication.

What we're still improving

As the pipeline matures:

optional NVD enrichment for CVSS/CWE overlays,
optional source-priority logic when records conflict,
package/ecosystem-specific materialized views for faster API queries.

The goal is reliability first, enrichment second.

References

If you are building this for CRA timelines, the real win is not "having a big database." The win is having a sync process you can trust when you need to answer impact questions quickly - and in a programmatic fashion that doesn't require pouring through NVD web pages or waiting for a CVE to be published.

Get started free → or book a demo if you want a CRA-focused workflow without operating this ingestion stack yourself.