← Back to Blog

Want to Build Your Own Local CVE Database? This Is More or Less How We Do It

· Verimu
CVEOSVPostgreSQLGoVulnerability Management

This is more or less how the Verimu vulnerability ingestion system handles CVE/advisory synchronization today.

Important context up front: this is the practical architecture we implement, with a few details still being refined as we harden it.

The requirement is simple:

  • keep a local vulnerability dataset in Postgres,
  • update it frequently,
  • expose it programmatically to internal systems,
  • keep infra lightweight.

No UI required.

The two artifacts we sync

For this setup, we model ingestion as:

  1. all snapshot (full baseline)
  2. update delta (incremental changes)

In OSV terms, that is:

  • all.zip (full dump),
  • modified_id.csv (reverse-chronological change stream).

The OSV docs describe both endpoints and the modified_id.csv format in detail:

Strictly speaking, only one of these is a ZIP file. We still treat them as the same pattern operationally: full baseline + incremental updates.

Our cadence

  • Weekly: refresh from full all.zip baseline
  • Hourly: process modified_id.csv updates and upsert changed records

Why this split:

  • weekly full sync is a safety net against drift, missed jobs, or parser bugs,
  • hourly deltas keep us current without reprocessing everything.

High-level data model

We keep this intentionally lean:

  • vulnerabilities (id, modified_at, published_at, withdrawn_at, summary)
  • aliases (vuln_id, alias)
  • affected_packages (vuln_id, ecosystem, package_name, purl, introduced, fixed)
  • sync_state (key, last_modified)

That is enough for most API queries:

  • "show latest CVEs affecting npm packages"
  • "all vulns for package X"
  • "map advisory IDs to CVE aliases"

Go snippet: poll the update stream

This is the key loop: read modified_id.csv from the top, stop when you hit your last watermark, fetch changed records, upsert.

package main

import (
  "encoding/csv"
  "net/http"
  "time"
)

const modifiedCSV = "https://storage.googleapis.com/osv-vulnerabilities/modified_id.csv"

func syncOnce(lastSeen time.Time) (time.Time, error) {
  resp, err := http.Get(modifiedCSV)
  if err != nil {
    return lastSeen, err
  }
  defer resp.Body.Close()

  r := csv.NewReader(resp.Body)
  newest := lastSeen

  for {
    row, err := r.Read()
    if err != nil {
      break
    }

    // row[0] = ISO timestamp, row[1] = "<ecosystem>/<id>"
    ts, _ := time.Parse(time.RFC3339, row[0])
    if !lastSeen.IsZero() && !ts.After(lastSeen) {
      // CSV is reverse chronological, so we can stop early.
      break
    }

    if ts.After(newest) {
      newest = ts
    }

    // Example row[1]: "Go/GO-2026-4337"
    // Fetch JSON: https://storage.googleapis.com/osv-vulnerabilities/Go/GO-2026-4337.json
    _ = fetchAndUpsert(row[1] + ".json")
  }

  return newest, nil
}

Go snippet: baseline bootstrap from all.zip

On weekly refresh, we stream entries from all.zip and upsert each JSON object.

package main

import (
  "archive/zip"
  "bytes"
  "io"
  "net/http"
)

const allZipURL = "https://storage.googleapis.com/osv-vulnerabilities/all.zip"

func fullRefresh() error {
  resp, err := http.Get(allZipURL)
  if err != nil {
    return err
  }
  defer resp.Body.Close()

  body, err := io.ReadAll(resp.Body)
  if err != nil {
    return err
  }

  zr, err := zip.NewReader(bytes.NewReader(body), int64(len(body)))
  if err != nil {
    return err
  }

  for _, f := range zr.File {
    rc, err := f.Open()
    if err != nil {
      return err
    }
    if err := upsertFromJSON(rc); err != nil {
      rc.Close()
      return err
    }
    rc.Close()
  }
  return nil
}

In our implementation, we also layer in:

  • streaming/decompression that avoids loading full ZIP into memory,
  • bounded concurrency worker pools,
  • retry + dead-letter handling for parse failures,
  • structured metrics (processed, updated, withdrawn, failed).

Handling "is this a CVE?" correctly

One common surprise: many records are not keyed by CVE-* at the top level.

Examples from real update streams:

  • Go/GO-2026-4337
  • Red Hat/RHSA-2026:3970
  • CleanStart/CLEANSTART-2026-VC01496

That is expected. OSV IDs are source-native. CVE IDs are usually in aliases when available.

Design implication:

  • use id as your primary key,
  • index aliases separately,
  • never assume one record equals one CVE string.

Size planning (lean schema)

For a lean normalized schema, a practical planning range is:

  • ~1 to 3 GB for core tables + indexes at current scale,
  • more if you retain full raw JSON blobs and historical snapshots.

On any small linux box (8 GB RAM with 80GB+ and up), this is usually comfortable if you keep ingestion workers constrained and avoid unnecessary duplication.

What we're still improving

As the pipeline matures:

  1. optional NVD enrichment for CVSS/CWE overlays,
  2. optional source-priority logic when records conflict,
  3. package/ecosystem-specific materialized views for faster API queries.

The goal is reliability first, enrichment second.

References


If you are building this for CRA timelines, the real win is not "having a big database." The win is having a sync process you can trust when you need to answer impact questions quickly - and in a programmatic fashion that doesn't require pouring through NVD web pages or waiting for a CVE to be published.

Get started free → or book a demo if you want a CRA-focused workflow without operating this ingestion stack yourself.