# Clone and run — it's just Docker
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
./run-ab-platform.sh
This gives you a UI at localhost:8000. Default creds: airbyte / password.
Requirements: Docker Desktop with ~4GB RAM allocated.
Option A — File Source connector:
Option B — For XML specifically:
# Quick XML → CSV for one-time load
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
# parse into rows...
df = pd.DataFrame(rows)
df.to_csv('data.csv', index=False)
CREATE DATABASE shmoop_staging;
Airbyte will auto-create schemas per source:
| Source | Schedule | Rationale |
|---|---|---|
| HubSpot | Every 6 hours | CRM changes frequently |
| Shmoop MySQL | Every 12 hours | School activity is less volatile |
| QuickBooks | Daily | Financials reconcile on day boundaries |
| XML/XLS | Manual trigger | One-time or as-needed |
This is where we solve the inconsistencies. Create a dbt project:
pip install dbt-postgres
dbt init shmoop_crm
shmoop_crm/
├── models/
│ ├── staging/ # Clean each source individually
│ │ ├── stg_hubspot_contacts.sql
│ │ ├── stg_hubspot_deals.sql
│ │ ├── stg_shmoop_schools.sql
│ │ ├── stg_quickbooks_customers.sql
│ │ ├── stg_quickbooks_invoices.sql
│ │ └── stg_file_imports.sql
│ │
│ ├── intermediate/ # Match & merge across sources
│ │ ├── int_matched_accounts.sql
│ │ └── int_matched_contacts.sql
│ │
│ └── marts/ # Final clean tables
│ ├── canonical_contacts.sql
│ ├── canonical_accounts.sql
│ └── canonical_financials.sql
│
├── tests/ # Data quality checks
│ ├── unique_email.sql
│ ├── no_orphan_invoices.sql
│ └── hubspot_quickbooks_match_rate.sql
-- models/intermediate/int_matched_accounts.sql
WITH hubspot AS (
SELECT
company_id,
LOWER(TRIM(company_name)) AS name_clean,
LOWER(TRIM(domain)) AS domain_clean,
company_name AS original_name
FROM {{ ref('stg_hubspot_companies') }}
),
quickbooks AS (
SELECT
customer_id,
LOWER(TRIM(display_name)) AS name_clean,
LOWER(TRIM(email_domain)) AS domain_clean,
display_name AS original_name
FROM {{ ref('stg_quickbooks_customers') }}
)
SELECT
h.company_id AS hubspot_id,
q.customer_id AS quickbooks_id,
COALESCE(h.original_name, q.original_name) AS account_name,
CASE
WHEN h.domain_clean = q.domain_clean THEN 'domain_match'
WHEN h.name_clean = q.name_clean THEN 'exact_name'
WHEN SIMILARITY(h.name_clean, q.name_clean) > 0.7 THEN 'fuzzy_name'
ELSE 'unmatched'
END AS match_type
FROM hubspot h
FULL OUTER JOIN quickbooks q
ON h.domain_clean = q.domain_clean
OR SIMILARITY(h.name_clean, q.name_clean) > 0.7
-- tests/hubspot_quickbooks_match_rate.sql
-- Fails if less than 80% of QuickBooks customers match a HubSpot company
SELECT
COUNT(*) FILTER (WHERE match_type = 'unmatched')::float
/ COUNT(*)::float AS unmatched_rate
FROM {{ ref('int_matched_accounts') }}
HAVING
COUNT(*) FILTER (WHERE match_type = 'unmatched')::float
/ COUNT(*)::float > 0.20
Once the marts/ models are clean, push to wherever the CRM lives:
| Problem | Solution |
|---|---|
| Inconsistent names/emails across sources | dbt staging models normalize formats |
| Same entity in HubSpot + QuickBooks with different IDs | Intermediate matching models create a canonical ID |
| Spreadsheet data doesn't match DB schemas | File source + staging transform standardizes it |
| No visibility into data quality | dbt tests catch issues before they propagate |
| Manual MCP pulls are ad-hoc | Airbyte syncs on schedule, repeatable |