Shmoop CRM Data Consolidation

Airbyte OSS Architecture — March 2026

Architecture Overview

┌─────────────┐ │ HubSpot │──── Airbyte Connector (native) ────┐ └─────────────┘ │ ┌─────────────┐ │ │ Shmoop MySQL│──── Airbyte Connector (native) ────┐│ │ (school │ ││ │ activity) │ ││ └─────────────┘ ▼▼ ┌──────────────┐ ┌─────────────┐ │ Postgres │ │ QuickBooks │──── Airbyte Connector ─────▶│ Staging │ │ (financials)│ (native) │ Warehouse │ └─────────────┘ │ │ ┌─────────────┐ │ raw_hubspot │ │ XML/XLS │──── Airbyte File/Custom ───▶│ raw_shmoop │ │ Spreadsheets│ Source │ raw_quickbooks│ └─────────────┘ │ raw_files │ └──────┬───────┘ │ dbt transforms │ ▼ ┌──────────────┐ │ Clean CRM │ │ canonical_ │ │ contacts │ │ canonical_ │ │ accounts │ │ canonical_ │ │ financials │ └──────────────┘

Step 1: Deploy Airbyte OSS

# Clone and run — it's just Docker
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
./run-ab-platform.sh

This gives you a UI at localhost:8000. Default creds: airbyte / password.

Requirements: Docker Desktop with ~4GB RAM allocated.

Step 2: Configure Sources

HubSpot Native Connector

Shmoop MySQL Native Connector

QuickBooks Native Connector

XML/XLS Files File / Custom

Option A — File Source connector:

Option B — For XML specifically:

# Quick XML → CSV for one-time load
import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('data.xml')
# parse into rows...
df = pd.DataFrame(rows)
df.to_csv('data.csv', index=False)

Step 3: Configure Destination

Postgres Staging Warehouse

CREATE DATABASE shmoop_staging;

Airbyte will auto-create schemas per source:

Step 4: Sync Schedules

SourceScheduleRationale
HubSpotEvery 6 hoursCRM changes frequently
Shmoop MySQLEvery 12 hoursSchool activity is less volatile
QuickBooksDailyFinancials reconcile on day boundaries
XML/XLSManual triggerOne-time or as-needed

Step 5: dbt Transforms (Reconciliation Layer)

This is where we solve the inconsistencies. Create a dbt project:

pip install dbt-postgres
dbt init shmoop_crm

Project Structure

shmoop_crm/
├── models/
│   ├── staging/           # Clean each source individually
│   │   ├── stg_hubspot_contacts.sql
│   │   ├── stg_hubspot_deals.sql
│   │   ├── stg_shmoop_schools.sql
│   │   ├── stg_quickbooks_customers.sql
│   │   ├── stg_quickbooks_invoices.sql
│   │   └── stg_file_imports.sql
│   │
│   ├── intermediate/      # Match & merge across sources
│   │   ├── int_matched_accounts.sql
│   │   └── int_matched_contacts.sql
│   │
│   └── marts/             # Final clean tables
│       ├── canonical_contacts.sql
│       ├── canonical_accounts.sql
│       └── canonical_financials.sql
│
├── tests/                 # Data quality checks
│   ├── unique_email.sql
│   ├── no_orphan_invoices.sql
│   └── hubspot_quickbooks_match_rate.sql

Example: Matching HubSpot to QuickBooks

-- models/intermediate/int_matched_accounts.sql

WITH hubspot AS (
    SELECT
        company_id,
        LOWER(TRIM(company_name)) AS name_clean,
        LOWER(TRIM(domain)) AS domain_clean,
        company_name AS original_name
    FROM {{ ref('stg_hubspot_companies') }}
),

quickbooks AS (
    SELECT
        customer_id,
        LOWER(TRIM(display_name)) AS name_clean,
        LOWER(TRIM(email_domain)) AS domain_clean,
        display_name AS original_name
    FROM {{ ref('stg_quickbooks_customers') }}
)

SELECT
    h.company_id AS hubspot_id,
    q.customer_id AS quickbooks_id,
    COALESCE(h.original_name, q.original_name) AS account_name,
    CASE
        WHEN h.domain_clean = q.domain_clean THEN 'domain_match'
        WHEN h.name_clean = q.name_clean THEN 'exact_name'
        WHEN SIMILARITY(h.name_clean, q.name_clean) > 0.7 THEN 'fuzzy_name'
        ELSE 'unmatched'
    END AS match_type
FROM hubspot h
FULL OUTER JOIN quickbooks q
    ON h.domain_clean = q.domain_clean
    OR SIMILARITY(h.name_clean, q.name_clean) > 0.7
Note: Enable the pg_trgm extension in Postgres for the SIMILARITY() function.

Example: Data Quality Test

-- tests/hubspot_quickbooks_match_rate.sql
-- Fails if less than 80% of QuickBooks customers match a HubSpot company

SELECT
    COUNT(*) FILTER (WHERE match_type = 'unmatched')::float
    / COUNT(*)::float AS unmatched_rate
FROM {{ ref('int_matched_accounts') }}
HAVING
    COUNT(*) FILTER (WHERE match_type = 'unmatched')::float
    / COUNT(*)::float > 0.20

Step 6: Push Clean Data to Destination

Once the marts/ models are clean, push to wherever the CRM lives:

What This Solves

ProblemSolution
Inconsistent names/emails across sourcesdbt staging models normalize formats
Same entity in HubSpot + QuickBooks with different IDsIntermediate matching models create a canonical ID
Spreadsheet data doesn't match DB schemasFile source + staging transform standardizes it
No visibility into data qualitydbt tests catch issues before they propagate
Manual MCP pulls are ad-hocAirbyte syncs on schedule, repeatable

Cost / Effort

Key principle: Never transform in flight. Land the raw data first, then reconcile in a place where you can inspect, test, and iterate before pushing to the final destination.