DBA Vector Analysis

PostgresVectors

Analyze pgvector database storage, identify vector columns, assess space utilization and performance for RAG applications.

Created by Fanshi Zhang

2025-08-18

Performance OptimizationAudit And ComplianceStatistical Aggregation

Model Ranking

Click on the dots to view the trajectory of each task run

Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
Model	Run Results	Pass@4	Pass^4	Avg Time	Avg Turns	Input Tokens	Output Tokens	Total Tokens
claude-sonnet-4	4 /4			347.3s	33.8	540,176	10,195	550,372
claude-sonnet-4-high	4 /4			191.3s	28.5	459,313	8,337	467,650
claude-sonnet-4-low	4 /4			208.5s	32.3	539,287	8,316	547,603
gpt-5-high	4 /4			3027.4s	39.8	516,150	109,050	625,200
gpt-5-low	4 /4			753.8s	26.0	607,676	46,510	654,186
gpt-5-medium	4 /4			1225.2s	35.5	459,682	59,914	519,596
gpt-5-mini-high	4 /4			569.8s	27.8	738,823	70,074	808,896
kimi-k2-0711	4 /4			216.2s	31.3	227,560	4,688	232,248
qwen-3-coder-plus	4 /4			89.3s	32.3	368,762	3,132	371,894
qwen-3-max	4 /4			91.3s	29.0	283,119	2,325	285,445
deepseek-chat	3 /4			290.3s	28.0	371,896	3,632	375,528
gpt-4-1-mini	3 /4			99.1s	21.5	217,754	4,218	221,971
gpt-5-mini-medium	3 /4			355.6s	24.8	356,677	21,827	378,504
grok-4	3 /4			207.9s	33.8	310,551	7,345	317,896
grok-code-fast-1	3 /4			101.1s	29.8	428,660	11,754	440,415
kimi-k2-0905	3 /4			289.9s	25.0	223,687	4,074	227,760
o3	3 /4			104.8s	25.0	117,575	5,932	123,507
o4-mini	3 /4			166.6s	9.0	45,541	11,066	56,607
gemini-2-5-flash	2 /4			155.8s	26.3	278,232	24,911	303,143
glm-4-5	2 /4			174.0s	25.0	332,272	5,824	338,095
gemini-2-5-pro	1 /4			32.1s	2.5	4,981	2,439	7,420
gpt-5-nano-low	1 /4			43.1s	8.8	38,525	5,948	44,472
gpt-5-nano-medium	1 /4			236.6s	28.8	562,267	42,093	604,360
claude-opus-4-1	0 /1	-	-	352.9s	27.0	441,117	8,585	449,702
gpt-4-1	0 /4			73.5s	13.3	477,884	2,657	480,541
gpt-4-1-nano	0 /4			10.3s	5.3	23,137	479	23,616
gpt-5-mini-low	0 /4			77.1s	7.8	40,154	7,366	47,520
gpt-5-nano-high	0 /4			603.6s	53.5	912,455	122,531	1,034,985
gpt-oss-120b	0 /4			31.6s	6.5	24,877	1,885	26,761

Task State

Table "documents" { "id" int4 [pk, not null, increment] "title" text [not null] "content" text [not null] "source_url" text "document_type" varchar(50) [default: 'article'] "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "updated_at" timestamp [default: `CURRENT_TIMESTAMP`] "word_count" int4 "embedding" public.vector Indexes { created_at [type: btree, name: "documents_created_idx"] embedding [type: hnsw, name: "documents_embedding_idx"] title [type: btree, name: "documents_title_idx"] document_type [type: btree, name: "documents_type_idx"] } } Table "document_chunks" { "id" int4 [pk, not null, increment] "document_id" int4 "chunk_index" int4 [not null] "chunk_text" text [not null] "chunk_size" int4 "overlap_size" int4 [default: 0] "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "embedding" public.vector Indexes { document_id [type: btree, name: "chunks_doc_id_idx"] embedding [type: hnsw, name: "chunks_embedding_idx"] chunk_index [type: btree, name: "chunks_index_idx"] } } Table "user_queries" { "id" int4 [pk, not null, increment] "query_text" text [not null] "user_id" varchar(100) "session_id" varchar(100) "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "response_time_ms" int4 "embedding" public.vector Indexes { created_at [type: btree, name: "queries_created_idx"] embedding [type: hnsw, name: "queries_embedding_idx"] user_id [type: btree, name: "queries_user_idx"] } } Table "embedding_models" { "id" int4 [pk, not null, increment] "model_name" varchar(100) [unique, not null] "provider" varchar(50) [not null] "dimensions" int4 [not null] "max_tokens" int4 "cost_per_token" numeric(10,8) "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "is_active" bool [default: true] } Table "knowledge_base" { "id" int4 [pk, not null, increment] "kb_name" varchar(100) [not null] "description" text "domain" varchar(50) "language" varchar(10) [default: 'en'] "total_documents" int4 [default: 0] "total_chunks" int4 [default: 0] "total_storage_mb" numeric(10,2) "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "updated_at" timestamp [default: `CURRENT_TIMESTAMP`] } Table "search_cache" { "id" int4 [pk, not null, increment] "query_hash" varchar(64) [not null] "query_text" text [not null] "results_json" jsonb "result_count" int4 "search_time_ms" int4 "similarity_threshold" numeric(4,3) "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "expires_at" timestamp Indexes { expires_at [type: btree, name: "cache_expires_idx"] query_hash [type: btree, name: "cache_hash_idx"] } } Ref "document_chunks_document_id_fkey":"documents"."id" < "document_chunks"."document_id" [delete: cascade]

Instruction

PostgreSQL Vector Database Analysis

Analyze and optimize a pgvector-powered database to understand storage patterns, performance characteristics, and data quality for embeddings in production workloads.

What's this about?

You've got a PostgreSQL database running with the vector extension that stores embeddings for RAG (document similarity search, image recognition), or other ML workloads. Your job is to dive deep into this vector database and figure out what's going on under the hood. You need to understand:

how vectors are stored
how much space they're taking up
whether indexes are working properly
if there are any data quality issues lurking around

What you need to investigate

First, get familiar with what you're working with:

Check vector extension status: ensuring it's installed properly, check version, identify any configuration issues
Identify all vector columns across entire database: providing me columns, types of columns, and vector dim (dimensions)
Map the vector landscape: understand relationships between vector tables and regular tables, foreign keys, dependencies

Vectors can eat up a lot of storage, so let's see where the bytes are going:

Calculate vector storage overhead: measure how much space vectors take compared to regular columns in same tables
Analyze table sizes: identify which vector tables are biggest storage consumers, break down by table
Understand growth patterns: examine record counts and project future storage needs based on current data

Vectors without proper indexes are painfully slow, so investigate:

Catalog vector indexes: find all HNSW and IVFFlat indexes, document their configurations and parameters
Measure index effectiveness: determine if indexes are actually being used and helping query performance
Identify optimization opportunities: spot missing indexes, suboptimal configurations, unused indexes

Bad vector data makes everything worse:

Hunt for data issues: locate NULL vectors, dimension mismatches, corrupted embeddings that could break queries
Validate consistency: ensure vectors in each column have consistent dimensions across all rows
Check for outliers: find vectors that might be skewing similarity calculations or causing performance issues

Your deliverables

Create these analysis tables and populate them with your findings:

`vector_analysis_columns`

Complete catalog of every vector column you find:

SQL

CREATE TABLE vector_analysis_columns (
    schema VARCHAR(50),
    table_name VARCHAR(100),
    column_name VARCHAR(100),
    dimensions INTEGER,
    data_type VARCHAR(50),
    has_constraints BOOLEAN,
    rows BIGINT
);

`vector_analysis_storage_consumption`

Show exactly where storage is being consumed:

SQL

CREATE TABLE vector_analysis_storage_consumption (
    schema VARCHAR(50),
    table_name VARCHAR(100),
    total_size_bytes BIGINT,
    vector_data_bytes BIGINT,
    regular_data_bytes BIGINT,
    vector_storage_pct NUMERIC(5,2),
    row_count BIGINT
);

`vector_analysis_indices`

Document all vector indexes and their characteristics:

SQL

CREATE TABLE vector_analysis_indices (
    schema VARCHAR(50),
    table_name VARCHAR(100),
    column_name VARCHAR(100),
    index_name VARCHAR(100),
    index_type VARCHAR(50), -- 'hnsw', 'ivfflat', etc.
    index_size_bytes BIGINT
);

Use PostgreSQL system catalogs, pgvector-specific views, and storage analysis functions to gather comprehensive metrics about the vector database implementation.

Verify

Python

"""
Verification script for Vector Database DBA Analysis task.

This script verifies that the candidate has properly analyzed the vector database
and stored their findings in appropriate result tables.
"""

import logging
import psycopg2
import os
import sys
from typing import Dict, Any

logger = logging.getLogger(__name__)


def get_connection_params():
    """Get database connection parameters from environment variables."""
    return {
        "host": os.getenv("POSTGRES_HOST", "localhost"),
        "port": int(os.getenv("POSTGRES_PORT", 5432)),
        "database": os.getenv("POSTGRES_DATABASE"),
        "user": os.getenv("POSTGRES_USERNAME"),
        "password": os.getenv("POSTGRES_PASSWORD"),
    }


def verify_vector_analysis_columns(conn) -> Dict[str, Any]:
    """Verify the vector_analysis_columns table exists, has correct columns, and contains actual vector columns from the database."""
    results = {'passed': False, 'issues': []}
    expected_columns = [
        'schema', 'table_name', 'column_name', 'dimensions', 'data_type', 'has_constraints', 'rows'
    ]
    try:
        with conn.cursor() as cur:
            # Check if table exists
            cur.execute("""
                SELECT EXISTS (
                    SELECT FROM information_schema.tables
                    WHERE table_name = 'vector_analysis_columns'
                );
            """)
            if not cur.fetchone()[0]:
                results['issues'].append("vector_analysis_columns table not found")
                return results

            # Check columns
            cur.execute("""
                SELECT column_name FROM information_schema.columns
                WHERE table_name = 'vector_analysis_columns'
                ORDER BY column_name;
            """)
            actual_columns = {row[0] for row in cur.fetchall()}
            missing = set(expected_columns) - actual_columns
            extra = actual_columns - set(expected_columns)
            if missing:
                results['issues'].append(f"Missing columns: {missing}")
            if extra:
                results['issues'].append(f"Unexpected columns: {extra}")

            # Check for data
            cur.execute("SELECT COUNT(*) FROM vector_analysis_columns;")
            count = cur.fetchone()[0]
            if count == 0:
                results['issues'].append("No rows found in vector_analysis_columns")
                return results

            # Get actual vector columns from the database
            cur.execute("""
                SELECT table_name, column_name
                FROM information_schema.columns
                WHERE data_type = 'USER-DEFINED'
                AND udt_name = 'vector'
                ORDER BY table_name, column_name;
            """)
            actual_vector_columns = set(cur.fetchall())

            # Get what the agent found
            cur.execute("""
                SELECT table_name, column_name
                FROM vector_analysis_columns
                ORDER BY table_name, column_name;
            """)
            found_vector_columns = set(cur.fetchall())

            # Check if agent found the actual vector columns
            missing_vectors = actual_vector_columns - found_vector_columns
            extra_vectors = found_vector_columns - actual_vector_columns

            if missing_vectors:
                results['issues'].append(f"Missing: {missing_vectors}")
            if extra_vectors:
                results['issues'].append(f"Non-existing: {extra_vectors}")

            if not missing and not extra and count > 0 and not missing_vectors and not extra_vectors:
                results['passed'] = True

    except psycopg2.Error as e:
        results['issues'].append(f"Database error: {e}")
    except Exception as e:
        results['issues'].append(f"Verification error: {e}")
    return results


def verify_vector_analysis_storage_consumption(conn) -> Dict[str, Any]:
    """Verify the vector_analysis_storage_consumption table exists, has correct columns, and analyzes actual vector tables."""
    results = {'passed': False, 'issues': []}
    expected_columns = [
        'schema', 'table_name', 'total_size_bytes', 'vector_data_bytes', 'regular_data_bytes', 'vector_storage_pct', 'row_count'
    ]
    try:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT EXISTS (
                    SELECT FROM information_schema.tables
                    WHERE table_name = 'vector_analysis_storage_consumption'
                );
            """)
            if not cur.fetchone()[0]:
                results['issues'].append("vector_analysis_storage_consumption table not found")
                return results

            cur.execute("""
                SELECT column_name FROM information_schema.columns
                WHERE table_name = 'vector_analysis_storage_consumption'
                ORDER BY column_name;
            """)
            actual_columns = {row[0] for row in cur.fetchall()}
            missing = set(expected_columns) - actual_columns
            extra = actual_columns - set(expected_columns)
            if missing:
                results['issues'].append(f"Missing columns: {missing}")
            if extra:
                results['issues'].append(f"Unexpected columns: {extra}")

            cur.execute("SELECT COUNT(*) FROM vector_analysis_storage_consumption;")
            count = cur.fetchone()[0]
            if count == 0:
                results['issues'].append("No rows found in vector_analysis_storage_consumption")
                return results

            # Get actual tables with vector columns
            cur.execute("""
                SELECT DISTINCT table_name
                FROM information_schema.columns
                WHERE data_type = 'USER-DEFINED'
                AND udt_name = 'vector'
                ORDER BY table_name;
            """)
            actual_vector_tables = {row[0] for row in cur.fetchall()}

            # Get what the agent analyzed
            cur.execute("""
                SELECT DISTINCT table_name
                FROM vector_analysis_storage_consumption
                ORDER BY table_name;
            """)
            analyzed_tables = {row[0] for row in cur.fetchall()}

            # Check if agent analyzed the actual vector tables
            missing_tables = actual_vector_tables - analyzed_tables
            if missing_tables:
                results['issues'].append(f"Agent missed analyzing vector tables: {missing_tables}")

            # Check that analyzed tables actually have vector columns
            extra_tables = analyzed_tables - actual_vector_tables
            if extra_tables:
                results['issues'].append(f"Agent analyzed non-vector tables: {extra_tables}")

            if not missing and not extra and count > 0 and not missing_tables and not extra_tables:
                results['passed'] = True

    except psycopg2.Error as e:
        results['issues'].append(f"Database error: {e}")
    except Exception as e:
        results['issues'].append(f"Verification error: {e}")
    return results


def verify_vector_analysis_indices(conn) -> Dict[str, Any]:
    """Verify the vector_analysis_indices table exists, has correct columns, and identifies actual vector indexes."""
    results = {'passed': False, 'issues': []}
    expected_columns = [
        'schema', 'table_name', 'column_name', 'index_name', 'index_type', 'index_size_bytes'
    ]
    try:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT EXISTS (
                    SELECT FROM information_schema.tables
                    WHERE table_name = 'vector_analysis_indices'
                );
            """)
            if not cur.fetchone()[0]:
                results['issues'].append("vector_analysis_indices table not found")
                return results

            cur.execute("""
                SELECT column_name FROM information_schema.columns
                WHERE table_name = 'vector_analysis_indices'
                ORDER BY column_name;
            """)
            actual_columns = {row[0] for row in cur.fetchall()}
            missing = set(expected_columns) - actual_columns
            extra = actual_columns - set(expected_columns)
            if missing:
                results['issues'].append(f"Missing columns: {missing}")
            if extra:
                results['issues'].append(f"Unexpected columns: {extra}")

            cur.execute("SELECT COUNT(*) FROM vector_analysis_indices;")
            count = cur.fetchone()[0]
            if count == 0:
                results['issues'].append("No rows found in vector_analysis_indices")
                return results

            # Get actual vector indexes from the database (exclude ground truth table indexes)
            cur.execute("""
                SELECT schemaname, tablename, indexname
                FROM pg_indexes
                WHERE (indexdef ILIKE '%hnsw%' OR indexdef ILIKE '%ivfflat%')
                AND tablename NOT LIKE '%analysis%'
                ORDER BY tablename, indexname;
            """)
            actual_vector_indexes = set(cur.fetchall())

            # Get what the agent found
            cur.execute("""
                SELECT schema, table_name, index_name
                FROM vector_analysis_indices
                ORDER BY table_name, index_name;
            """)
            found_indexes = set(cur.fetchall())

            # Check if agent found the actual vector indexes
            missing_indexes = actual_vector_indexes - found_indexes
            if missing_indexes:
                results['issues'].append(f"Agent missed vector indexes: {missing_indexes}")

            # Allow agent to find more indexes than just vector ones (they might include related indexes)
            # but at least they should find the vector-specific ones

            if not missing and not extra and count > 0 and not missing_indexes:
                results['passed'] = True

    except psycopg2.Error as e:
        results['issues'].append(f"Database error: {e}")
    except Exception as e:
        results['issues'].append(f"Verification error: {e}")
    return results


def verify_no_extra_analysis_tables(conn) -> Dict[str, Any]:
    """Check that only the required analysis tables exist (no legacy/extra analysis tables)."""
    results = {'passed': True, 'issues': []}  # Start with passed=True, more lenient
    required = {
        'vector_analysis_columns',
        'vector_analysis_storage_consumption',
        'vector_analysis_indices',
    }
    try:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT table_name FROM information_schema.tables
                WHERE table_schema = 'public'
                AND table_name LIKE 'vector_analysis_%';
            """)
            analysis_tables = {row[0] for row in cur.fetchall()}

            # Only flag as issue if there are analysis tables that don't match our required set
            # Exclude ground truth tables from this check
            analysis_tables_filtered = {t for t in analysis_tables if not t.startswith('expected_') and not t.startswith('vector_analysis_results')}
            extra = analysis_tables_filtered - required
            if extra:
                results['issues'].append(f"Found unexpected analysis tables: {extra}")
                results['passed'] = False

    except Exception as e:
        results['issues'].append(f"Verification error: {e}")
        results['passed'] = False
    return results



def main():
    """Main verification function for vector analysis deliverables."""

    conn_params = get_connection_params()
    if not conn_params["database"]:
        print("No database specified")
        sys.exit(1)
    try:
        conn = psycopg2.connect(**conn_params)
        checks = [
            ("vector_analysis_columns", verify_vector_analysis_columns),
            ("vector_analysis_storage_consumption", verify_vector_analysis_storage_consumption),
            ("vector_analysis_indices", verify_vector_analysis_indices),
            ("no_extra_analysis_tables", verify_no_extra_analysis_tables),
        ]
        passed_checks = 0
        all_issues = []
        for i, (desc, check_func) in enumerate(checks, 1):
            result = check_func(conn)
            if result['passed']:
                print(f"  PASSED")
                passed_checks += 1
            else:
                print(f"  FAILED")
                for issue in result['issues']:
                    print(f"    - {issue}")
                all_issues.extend(result['issues'])
            print()
        conn.close()
        total_checks = len(checks)
        print(f"Results: {passed_checks}/{total_checks} checks passed")
        if passed_checks == total_checks:
            sys.exit(0)
        elif passed_checks >= total_checks * 0.75:
            sys.exit(0)
        else:
            sys.exit(1)
    except psycopg2.Error as e:
        print(f"Database connection error: {e}")
        sys.exit(1)
    except Exception as e:
        print(f"Verification error: {e}")
        sys.exit(1)


if __name__ == "__main__":
    main()