DBA Vector Analysis

L3
ModelContextProtocolPostgresVectors

Analyze pgvector database storage, identify vector columns, assess space utilization and performance for RAG applications.

Created by Fanshi Zhang
2025-08-18
Performance OptimizationAudit And ComplianceStatistical Aggregation

Model Ranking

Click on the dots to view the trajectory of each task run
Model
Run Results
Pass@4
Pass^4
Avg Time
Avg Turns
Input Tokens
Output Tokens
Total Tokens
Claude
claude-sonnet-4
4
/4
347.3s
33.8
540,176
10,195
550,372
Claude
claude-sonnet-4-high
4
/4
191.3s
28.5
459,313
8,337
467,650
Claude
claude-sonnet-4-low
4
/4
208.5s
32.3
539,287
8,316
547,603
OpenAI
gpt-5-high
4
/4
3027.4s
39.8
516,150
109,050
625,200
OpenAI
gpt-5-low
4
/4
753.8s
26.0
607,676
46,510
654,186
OpenAI
gpt-5-medium
4
/4
1225.2s
35.5
459,682
59,914
519,596
OpenAI
gpt-5-mini-high
4
/4
569.8s
27.8
738,823
70,074
808,896
MoonshotAI
kimi-k2-0711
4
/4
216.2s
31.3
227,560
4,688
232,248
Qwen
qwen-3-coder-plus
4
/4
89.3s
32.3
368,762
3,132
371,894
Qwen
qwen-3-max
4
/4
91.3s
29.0
283,119
2,325
285,445
DeepSeek
deepseek-chat
3
/4
290.3s
28.0
371,896
3,632
375,528
OpenAI
gpt-4-1-mini
3
/4
99.1s
21.5
217,754
4,218
221,971
OpenAI
gpt-5-mini-medium
3
/4
355.6s
24.8
356,677
21,827
378,504
Grok
grok-4
3
/4
207.9s
33.8
310,551
7,345
317,896
Grok
grok-code-fast-1
3
/4
101.1s
29.8
428,660
11,754
440,415
MoonshotAI
kimi-k2-0905
3
/4
289.9s
25.0
223,687
4,074
227,760
OpenAI
o3
3
/4
104.8s
25.0
117,575
5,932
123,507
OpenAI
o4-mini
3
/4
166.6s
9.0
45,541
11,066
56,607
Gemini
gemini-2-5-flash
2
/4
155.8s
26.3
278,232
24,911
303,143
Z.ai
glm-4-5
2
/4
174.0s
25.0
332,272
5,824
338,095
Gemini
gemini-2-5-pro
1
/4
32.1s
2.5
4,981
2,439
7,420
OpenAI
gpt-5-nano-low
1
/4
43.1s
8.8
38,525
5,948
44,472
OpenAI
gpt-5-nano-medium
1
/4
236.6s
28.8
562,267
42,093
604,360
Claude
claude-opus-4-1
0
/1
--
352.9s
27.0
441,117
8,585
449,702
OpenAI
gpt-4-1
0
/4
73.5s
13.3
477,884
2,657
480,541
OpenAI
gpt-4-1-nano
0
/4
10.3s
5.3
23,137
479
23,616
OpenAI
gpt-5-mini-low
0
/4
77.1s
7.8
40,154
7,366
47,520
OpenAI
gpt-5-nano-high
0
/4
603.6s
53.5
912,455
122,531
1,034,985
OpenAI
gpt-oss-120b
0
/4
31.6s
6.5
24,877
1,885
26,761

Task State

Table "documents" { "id" int4 [pk, not null, increment] "title" text [not null] "content" text [not null] "source_url" text "document_type" varchar(50) [default: 'article'] "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "updated_at" timestamp [default: `CURRENT_TIMESTAMP`] "word_count" int4 "embedding" public.vector Indexes { created_at [type: btree, name: "documents_created_idx"] embedding [type: hnsw, name: "documents_embedding_idx"] title [type: btree, name: "documents_title_idx"] document_type [type: btree, name: "documents_type_idx"] } } Table "document_chunks" { "id" int4 [pk, not null, increment] "document_id" int4 "chunk_index" int4 [not null] "chunk_text" text [not null] "chunk_size" int4 "overlap_size" int4 [default: 0] "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "embedding" public.vector Indexes { document_id [type: btree, name: "chunks_doc_id_idx"] embedding [type: hnsw, name: "chunks_embedding_idx"] chunk_index [type: btree, name: "chunks_index_idx"] } } Table "user_queries" { "id" int4 [pk, not null, increment] "query_text" text [not null] "user_id" varchar(100) "session_id" varchar(100) "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "response_time_ms" int4 "embedding" public.vector Indexes { created_at [type: btree, name: "queries_created_idx"] embedding [type: hnsw, name: "queries_embedding_idx"] user_id [type: btree, name: "queries_user_idx"] } } Table "embedding_models" { "id" int4 [pk, not null, increment] "model_name" varchar(100) [unique, not null] "provider" varchar(50) [not null] "dimensions" int4 [not null] "max_tokens" int4 "cost_per_token" numeric(10,8) "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "is_active" bool [default: true] } Table "knowledge_base" { "id" int4 [pk, not null, increment] "kb_name" varchar(100) [not null] "description" text "domain" varchar(50) "language" varchar(10) [default: 'en'] "total_documents" int4 [default: 0] "total_chunks" int4 [default: 0] "total_storage_mb" numeric(10,2) "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "updated_at" timestamp [default: `CURRENT_TIMESTAMP`] } Table "search_cache" { "id" int4 [pk, not null, increment] "query_hash" varchar(64) [not null] "query_text" text [not null] "results_json" jsonb "result_count" int4 "search_time_ms" int4 "similarity_threshold" numeric(4,3) "created_at" timestamp [default: `CURRENT_TIMESTAMP`] "expires_at" timestamp Indexes { expires_at [type: btree, name: "cache_expires_idx"] query_hash [type: btree, name: "cache_hash_idx"] } } Ref "document_chunks_document_id_fkey":"documents"."id" < "document_chunks"."document_id" [delete: cascade]

Instruction

PostgreSQL Vector Database Analysis

Analyze and optimize a pgvector-powered database to understand storage patterns, performance characteristics, and data quality for embeddings in production workloads.

What's this about?

You've got a PostgreSQL database running with the vector extension that stores embeddings for RAG (document similarity search, image recognition), or other ML workloads. Your job is to dive deep into this vector database and figure out what's going on under the hood. You need to understand:

  • how vectors are stored
  • how much space they're taking up
  • whether indexes are working properly
  • if there are any data quality issues lurking around

What you need to investigate

First, get familiar with what you're working with:

  • Check vector extension status: ensuring it's installed properly, check version, identify any configuration issues
  • Identify all vector columns across entire database: providing me columns, types of columns, and vector dim (dimensions)
  • Map the vector landscape: understand relationships between vector tables and regular tables, foreign keys, dependencies

Vectors can eat up a lot of storage, so let's see where the bytes are going:

  • Calculate vector storage overhead: measure how much space vectors take compared to regular columns in same tables
  • Analyze table sizes: identify which vector tables are biggest storage consumers, break down by table
  • Understand growth patterns: examine record counts and project future storage needs based on current data

Vectors without proper indexes are painfully slow, so investigate:

  • Catalog vector indexes: find all HNSW and IVFFlat indexes, document their configurations and parameters
  • Measure index effectiveness: determine if indexes are actually being used and helping query performance
  • Identify optimization opportunities: spot missing indexes, suboptimal configurations, unused indexes

Bad vector data makes everything worse:

  • Hunt for data issues: locate NULL vectors, dimension mismatches, corrupted embeddings that could break queries
  • Validate consistency: ensure vectors in each column have consistent dimensions across all rows
  • Check for outliers: find vectors that might be skewing similarity calculations or causing performance issues

Your deliverables

Create these analysis tables and populate them with your findings:

vector_analysis_columns

Complete catalog of every vector column you find:

SQL
CREATE TABLE vector_analysis_columns (
    schema VARCHAR(50),
    table_name VARCHAR(100),
    column_name VARCHAR(100),
    dimensions INTEGER,
    data_type VARCHAR(50),
    has_constraints BOOLEAN,
    rows BIGINT
);

vector_analysis_storage_consumption

Show exactly where storage is being consumed:

SQL
CREATE TABLE vector_analysis_storage_consumption (
    schema VARCHAR(50),
    table_name VARCHAR(100),
    total_size_bytes BIGINT,
    vector_data_bytes BIGINT,
    regular_data_bytes BIGINT,
    vector_storage_pct NUMERIC(5,2),
    row_count BIGINT
);

vector_analysis_indices

Document all vector indexes and their characteristics:

SQL
CREATE TABLE vector_analysis_indices (
    schema VARCHAR(50),
    table_name VARCHAR(100),
    column_name VARCHAR(100),
    index_name VARCHAR(100),
    index_type VARCHAR(50), -- 'hnsw', 'ivfflat', etc.
    index_size_bytes BIGINT
);

Use PostgreSQL system catalogs, pgvector-specific views, and storage analysis functions to gather comprehensive metrics about the vector database implementation.



Verify

*.py
Python
"""
Verification script for Vector Database DBA Analysis task.

This script verifies that the candidate has properly analyzed the vector database
and stored their findings in appropriate result tables.
"""

import logging
import psycopg2
import os
import sys
from typing import Dict, Any

logger = logging.getLogger(__name__)


def get_connection_params():
    """Get database connection parameters from environment variables."""
    return {
        "host": os.getenv("POSTGRES_HOST", "localhost"),
        "port": int(os.getenv("POSTGRES_PORT", 5432)),
        "database": os.getenv("POSTGRES_DATABASE"),
        "user": os.getenv("POSTGRES_USERNAME"),
        "password": os.getenv("POSTGRES_PASSWORD"),
    }


def verify_vector_analysis_columns(conn) -> Dict[str, Any]:
    """Verify the vector_analysis_columns table exists, has correct columns, and contains actual vector columns from the database."""
    results = {'passed': False, 'issues': []}
    expected_columns = [
        'schema', 'table_name', 'column_name', 'dimensions', 'data_type', 'has_constraints', 'rows'
    ]
    try:
        with conn.cursor() as cur:
            # Check if table exists
            cur.execute("""
                SELECT EXISTS (
                    SELECT FROM information_schema.tables
                    WHERE table_name = 'vector_analysis_columns'
                );
            """)
            if not cur.fetchone()[0]:
                results['issues'].append("vector_analysis_columns table not found")
                return results

            # Check columns
            cur.execute("""
                SELECT column_name FROM information_schema.columns
                WHERE table_name = 'vector_analysis_columns'
                ORDER BY column_name;
            """)
            actual_columns = {row[0] for row in cur.fetchall()}
            missing = set(expected_columns) - actual_columns
            extra = actual_columns - set(expected_columns)
            if missing:
                results['issues'].append(f"Missing columns: {missing}")
            if extra:
                results['issues'].append(f"Unexpected columns: {extra}")

            # Check for data
            cur.execute("SELECT COUNT(*) FROM vector_analysis_columns;")
            count = cur.fetchone()[0]
            if count == 0:
                results['issues'].append("No rows found in vector_analysis_columns")
                return results

            # Get actual vector columns from the database
            cur.execute("""
                SELECT table_name, column_name
                FROM information_schema.columns
                WHERE data_type = 'USER-DEFINED'
                AND udt_name = 'vector'
                ORDER BY table_name, column_name;
            """)
            actual_vector_columns = set(cur.fetchall())

            # Get what the agent found
            cur.execute("""
                SELECT table_name, column_name
                FROM vector_analysis_columns
                ORDER BY table_name, column_name;
            """)
            found_vector_columns = set(cur.fetchall())

            # Check if agent found the actual vector columns
            missing_vectors = actual_vector_columns - found_vector_columns
            extra_vectors = found_vector_columns - actual_vector_columns

            if missing_vectors:
                results['issues'].append(f"Missing: {missing_vectors}")
            if extra_vectors:
                results['issues'].append(f"Non-existing: {extra_vectors}")

            if not missing and not extra and count > 0 and not missing_vectors and not extra_vectors:
                results['passed'] = True

    except psycopg2.Error as e:
        results['issues'].append(f"Database error: {e}")
    except Exception as e:
        results['issues'].append(f"Verification error: {e}")
    return results


def verify_vector_analysis_storage_consumption(conn) -> Dict[str, Any]:
    """Verify the vector_analysis_storage_consumption table exists, has correct columns, and analyzes actual vector tables."""
    results = {'passed': False, 'issues': []}
    expected_columns = [
        'schema', 'table_name', 'total_size_bytes', 'vector_data_bytes', 'regular_data_bytes', 'vector_storage_pct', 'row_count'
    ]
    try:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT EXISTS (
                    SELECT FROM information_schema.tables
                    WHERE table_name = 'vector_analysis_storage_consumption'
                );
            """)
            if not cur.fetchone()[0]:
                results['issues'].append("vector_analysis_storage_consumption table not found")
                return results

            cur.execute("""
                SELECT column_name FROM information_schema.columns
                WHERE table_name = 'vector_analysis_storage_consumption'
                ORDER BY column_name;
            """)
            actual_columns = {row[0] for row in cur.fetchall()}
            missing = set(expected_columns) - actual_columns
            extra = actual_columns - set(expected_columns)
            if missing:
                results['issues'].append(f"Missing columns: {missing}")
            if extra:
                results['issues'].append(f"Unexpected columns: {extra}")

            cur.execute("SELECT COUNT(*) FROM vector_analysis_storage_consumption;")
            count = cur.fetchone()[0]
            if count == 0:
                results['issues'].append("No rows found in vector_analysis_storage_consumption")
                return results

            # Get actual tables with vector columns
            cur.execute("""
                SELECT DISTINCT table_name
                FROM information_schema.columns
                WHERE data_type = 'USER-DEFINED'
                AND udt_name = 'vector'
                ORDER BY table_name;
            """)
            actual_vector_tables = {row[0] for row in cur.fetchall()}

            # Get what the agent analyzed
            cur.execute("""
                SELECT DISTINCT table_name
                FROM vector_analysis_storage_consumption
                ORDER BY table_name;
            """)
            analyzed_tables = {row[0] for row in cur.fetchall()}

            # Check if agent analyzed the actual vector tables
            missing_tables = actual_vector_tables - analyzed_tables
            if missing_tables:
                results['issues'].append(f"Agent missed analyzing vector tables: {missing_tables}")

            # Check that analyzed tables actually have vector columns
            extra_tables = analyzed_tables - actual_vector_tables
            if extra_tables:
                results['issues'].append(f"Agent analyzed non-vector tables: {extra_tables}")

            if not missing and not extra and count > 0 and not missing_tables and not extra_tables:
                results['passed'] = True

    except psycopg2.Error as e:
        results['issues'].append(f"Database error: {e}")
    except Exception as e:
        results['issues'].append(f"Verification error: {e}")
    return results


def verify_vector_analysis_indices(conn) -> Dict[str, Any]:
    """Verify the vector_analysis_indices table exists, has correct columns, and identifies actual vector indexes."""
    results = {'passed': False, 'issues': []}
    expected_columns = [
        'schema', 'table_name', 'column_name', 'index_name', 'index_type', 'index_size_bytes'
    ]
    try:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT EXISTS (
                    SELECT FROM information_schema.tables
                    WHERE table_name = 'vector_analysis_indices'
                );
            """)
            if not cur.fetchone()[0]:
                results['issues'].append("vector_analysis_indices table not found")
                return results

            cur.execute("""
                SELECT column_name FROM information_schema.columns
                WHERE table_name = 'vector_analysis_indices'
                ORDER BY column_name;
            """)
            actual_columns = {row[0] for row in cur.fetchall()}
            missing = set(expected_columns) - actual_columns
            extra = actual_columns - set(expected_columns)
            if missing:
                results['issues'].append(f"Missing columns: {missing}")
            if extra:
                results['issues'].append(f"Unexpected columns: {extra}")

            cur.execute("SELECT COUNT(*) FROM vector_analysis_indices;")
            count = cur.fetchone()[0]
            if count == 0:
                results['issues'].append("No rows found in vector_analysis_indices")
                return results

            # Get actual vector indexes from the database (exclude ground truth table indexes)
            cur.execute("""
                SELECT schemaname, tablename, indexname
                FROM pg_indexes
                WHERE (indexdef ILIKE '%hnsw%' OR indexdef ILIKE '%ivfflat%')
                AND tablename NOT LIKE '%analysis%'
                ORDER BY tablename, indexname;
            """)
            actual_vector_indexes = set(cur.fetchall())

            # Get what the agent found
            cur.execute("""
                SELECT schema, table_name, index_name
                FROM vector_analysis_indices
                ORDER BY table_name, index_name;
            """)
            found_indexes = set(cur.fetchall())

            # Check if agent found the actual vector indexes
            missing_indexes = actual_vector_indexes - found_indexes
            if missing_indexes:
                results['issues'].append(f"Agent missed vector indexes: {missing_indexes}")

            # Allow agent to find more indexes than just vector ones (they might include related indexes)
            # but at least they should find the vector-specific ones

            if not missing and not extra and count > 0 and not missing_indexes:
                results['passed'] = True

    except psycopg2.Error as e:
        results['issues'].append(f"Database error: {e}")
    except Exception as e:
        results['issues'].append(f"Verification error: {e}")
    return results


def verify_no_extra_analysis_tables(conn) -> Dict[str, Any]:
    """Check that only the required analysis tables exist (no legacy/extra analysis tables)."""
    results = {'passed': True, 'issues': []}  # Start with passed=True, more lenient
    required = {
        'vector_analysis_columns',
        'vector_analysis_storage_consumption',
        'vector_analysis_indices',
    }
    try:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT table_name FROM information_schema.tables
                WHERE table_schema = 'public'
                AND table_name LIKE 'vector_analysis_%';
            """)
            analysis_tables = {row[0] for row in cur.fetchall()}

            # Only flag as issue if there are analysis tables that don't match our required set
            # Exclude ground truth tables from this check
            analysis_tables_filtered = {t for t in analysis_tables if not t.startswith('expected_') and not t.startswith('vector_analysis_results')}
            extra = analysis_tables_filtered - required
            if extra:
                results['issues'].append(f"Found unexpected analysis tables: {extra}")
                results['passed'] = False

    except Exception as e:
        results['issues'].append(f"Verification error: {e}")
        results['passed'] = False
    return results



def main():
    """Main verification function for vector analysis deliverables."""

    conn_params = get_connection_params()
    if not conn_params["database"]:
        print("No database specified")
        sys.exit(1)
    try:
        conn = psycopg2.connect(**conn_params)
        checks = [
            ("vector_analysis_columns", verify_vector_analysis_columns),
            ("vector_analysis_storage_consumption", verify_vector_analysis_storage_consumption),
            ("vector_analysis_indices", verify_vector_analysis_indices),
            ("no_extra_analysis_tables", verify_no_extra_analysis_tables),
        ]
        passed_checks = 0
        all_issues = []
        for i, (desc, check_func) in enumerate(checks, 1):
            result = check_func(conn)
            if result['passed']:
                print(f"  PASSED")
                passed_checks += 1
            else:
                print(f"  FAILED")
                for issue in result['issues']:
                    print(f"    - {issue}")
                all_issues.extend(result['issues'])
            print()
        conn.close()
        total_checks = len(checks)
        print(f"Results: {passed_checks}/{total_checks} checks passed")
        if passed_checks == total_checks:
            sys.exit(0)
        elif passed_checks >= total_checks * 0.75:
            sys.exit(0)
        else:
            sys.exit(1)
    except psycopg2.Error as e:
        print(f"Database connection error: {e}")
        sys.exit(1)
    except Exception as e:
        print(f"Verification error: {e}")
        sys.exit(1)


if __name__ == "__main__":
    main()