Shmoop CRM Data Consolidation — Airbyte Architecture

Architecture Overview

┌─────────────┐ │ HubSpot │──── Airbyte Connector (native) ────┐ └─────────────┘ │ ┌─────────────┐ │ │ Shmoop MySQL│──── Airbyte Connector (native) ────┐│ │ (school │ ││ │ activity) │ ││ └─────────────┘ ▼▼ ┌──────────────┐ ┌─────────────┐ │ Postgres │ │ QuickBooks │──── Airbyte Connector ─────▶│ Staging │ │ (financials)│ (native) │ Warehouse │ └─────────────┘ │ │ ┌─────────────┐ │ raw_hubspot │ │ XML/XLS │──── Airbyte File/Custom ───▶│ raw_shmoop │ │ Spreadsheets│ Source │ raw_quickbooks│ └─────────────┘ │ raw_files │ └──────┬───────┘ │ dbt transforms │ ▼ ┌──────────────┐ │ Clean CRM │ │ canonical_ │ │ contacts │ │ canonical_ │ │ accounts │ │ canonical_ │ │ financials │ └──────────────┘

Step 1: Deploy Airbyte OSS

# Clone and run — it's just Docker
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
./run-ab-platform.sh

This gives you a UI at localhost:8000. Default creds: airbyte / password.

Requirements: Docker Desktop with ~4GB RAM allocated.

Step 2: Configure Sources

HubSpot Native Connector

Source type: HubSpot
Auth: OAuth or Private App Token
Streams to sync: contacts, companies, deals, line_items, products
Sync mode: Incremental - Append Dedup (only pulls changes)

Shmoop MySQL Native Connector

Source type: MySQL
Host/port/credentials for the shmoop DB
Select the tables with school activity data
Sync mode: Incremental if tables have updated_at, otherwise Full Refresh

QuickBooks Native Connector

Source type: QuickBooks
Auth: OAuth2
Streams: customers, invoices, payments, items
This is where we align financials to HubSpot deals/companies

XML/XLS Files File / Custom

Option A — File Source connector:

Source type: File (CSV, JSON, Excel, Parquet)
Point it at a local path, S3 bucket, or GCS bucket
Supports .xlsx, .csv natively

Option B — For XML specifically:

Convert XML → CSV with a small script, then use File source
Or use the Custom Python Source connector if XML is recurring

# Quick XML → CSV for one-time load
import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('data.xml')
# parse into rows...
df = pd.DataFrame(rows)
df.to_csv('data.csv', index=False)

Step 3: Configure Destination

Postgres Staging Warehouse

Destination type: Postgres
Create a dedicated database:

CREATE DATABASE shmoop_staging;

Airbyte will auto-create schemas per source:

raw_hubspot.contacts
raw_hubspot.deals
raw_shmoop.school_activity
raw_quickbooks.invoices
raw_quickbooks.customers
raw_files.spreadsheet_data

Step 4: Sync Schedules

Source	Schedule	Rationale
HubSpot	Every 6 hours	CRM changes frequently
Shmoop MySQL	Every 12 hours	School activity is less volatile
QuickBooks	Daily	Financials reconcile on day boundaries
XML/XLS	Manual trigger	One-time or as-needed

Step 5: dbt Transforms (Reconciliation Layer)

This is where we solve the inconsistencies. Create a dbt project:

pip install dbt-postgres
dbt init shmoop_crm

Project Structure

shmoop_crm/
├── models/
│   ├── staging/           # Clean each source individually
│   │   ├── stg_hubspot_contacts.sql
│   │   ├── stg_hubspot_deals.sql
│   │   ├── stg_shmoop_schools.sql
│   │   ├── stg_quickbooks_customers.sql
│   │   ├── stg_quickbooks_invoices.sql
│   │   └── stg_file_imports.sql
│   │
│   ├── intermediate/      # Match & merge across sources
│   │   ├── int_matched_accounts.sql
│   │   └── int_matched_contacts.sql
│   │
│   └── marts/             # Final clean tables
│       ├── canonical_contacts.sql
│       ├── canonical_accounts.sql
│       └── canonical_financials.sql
│
├── tests/                 # Data quality checks
│   ├── unique_email.sql
│   ├── no_orphan_invoices.sql
│   └── hubspot_quickbooks_match_rate.sql

Example: Matching HubSpot to QuickBooks

-- models/intermediate/int_matched_accounts.sql

WITH hubspot AS (
    SELECT
        company_id,
        LOWER(TRIM(company_name)) AS name_clean,
        LOWER(TRIM(domain)) AS domain_clean,
        company_name AS original_name
    FROM {{ ref('stg_hubspot_companies') }}
),

quickbooks AS (
    SELECT
        customer_id,
        LOWER(TRIM(display_name)) AS name_clean,
        LOWER(TRIM(email_domain)) AS domain_clean,
        display_name AS original_name
    FROM {{ ref('stg_quickbooks_customers') }}
)

SELECT
    h.company_id AS hubspot_id,
    q.customer_id AS quickbooks_id,
    COALESCE(h.original_name, q.original_name) AS account_name,
    CASE
        WHEN h.domain_clean = q.domain_clean THEN 'domain_match'
        WHEN h.name_clean = q.name_clean THEN 'exact_name'
        WHEN SIMILARITY(h.name_clean, q.name_clean) > 0.7 THEN 'fuzzy_name'
        ELSE 'unmatched'
    END AS match_type
FROM hubspot h
FULL OUTER JOIN quickbooks q
    ON h.domain_clean = q.domain_clean
    OR SIMILARITY(h.name_clean, q.name_clean) > 0.7

Note: Enable the pg_trgm extension in Postgres for the SIMILARITY() function.

Example: Data Quality Test

-- tests/hubspot_quickbooks_match_rate.sql
-- Fails if less than 80% of QuickBooks customers match a HubSpot company

SELECT
    COUNT(*) FILTER (WHERE match_type = 'unmatched')::float
    / COUNT(*)::float AS unmatched_rate
FROM {{ ref('int_matched_accounts') }}
HAVING
    COUNT(*) FILTER (WHERE match_type = 'unmatched')::float
    / COUNT(*)::float > 0.20

Step 6: Push Clean Data to Destination

Once the marts/ models are clean, push to wherever the CRM lives:

Back to HubSpot — use HubSpot API to update/enrich existing records
New Postgres CRM DB — point the app at the marts tables
Another tool — Airbyte can also reverse ETL (Postgres → destination)

What This Solves

Problem	Solution
Inconsistent names/emails across sources	dbt staging models normalize formats
Same entity in HubSpot + QuickBooks with different IDs	Intermediate matching models create a canonical ID
Spreadsheet data doesn't match DB schemas	File source + staging transform standardizes it
No visibility into data quality	dbt tests catch issues before they propagate
Manual MCP pulls are ad-hoc	Airbyte syncs on schedule, repeatable

Cost / Effort

Airbyte OSS: Free, self-hosted, ~30 min to deploy
Postgres: Free (local or existing server)
dbt: Free (open source CLI)
Main effort: Writing the dbt models — the matching/reconciliation SQL is where domain knowledge matters

Key principle: Never transform in flight. Land the raw data first, then reconcile in a place where you can inspect, test, and iterate before pushing to the final destination.