Every week, I hear some version of this question from clients: "Should we build a data warehouse? Or is it a data lake we need? What even is a data mart?"
The answer is always the same: It depends on what problem you're actually trying to solve.
After helping dozens of clients navigate data architecture decisions, I've developed a practical framework. No buzzwords, no vendor pitches—just clear answers to help you choose what you actually need.
The 30-Second Summary
If you're short on time, here's the quick version:
- Data Mart: Small, focused datasets for specific teams or use cases. Start here.
- Data Warehouse: Structured, cleaned data for organization-wide analytics. Graduate here.
- Data Lake: Raw data in all formats for data scientists and ML. Only if you need it.
Most businesses need a data mart (or a few of them). Some need a warehouse. Few actually need a lake.
Let me explain why.
Defining the Three Architectures
Data Mart: The Focused Specialist
A data mart is a subset of data optimized for a specific business area or use case.
Think of it like: A specialized library—everything curated for one subject.
Characteristics:
- Usually 1-100 GB in size
- Structured and pre-aggregated
- Optimized for specific queries
- Serves one team or use case
- Fast to query, easy to understand
Example use cases:
- Marketing dashboard with campaign metrics
- Finance reports with revenue and expenses
- Sales team performance tracker
- Customer support analytics
Typical tools: PostgreSQL, MySQL, BigQuery (small), Snowflake (small warehouse)
Data Warehouse: The Central Library
A data warehouse is a structured repository of cleaned, integrated data from across your organization.
Think of it like: A massive organized library with every book categorized perfectly.
Characteristics:
- Usually 100 GB to multiple TB
- Structured and normalized
- Historical data for trend analysis
- Serves entire organization
- Requires ETL pipelines to maintain
Example use cases:
- Cross-departmental analytics
- Executive dashboards pulling from multiple sources
- Historical trend analysis
- Regulatory reporting
Typical tools: Snowflake, BigQuery, Redshift, Azure Synapse
Data Lake: The Raw Archive
A data lake stores raw data in its native format—structured, semi-structured, and unstructured.
Think of it like: A giant warehouse of boxes. Everything is here, but you'll spend time finding and organizing what you need.
Characteristics:
- Often TBs to PBs in size
- Raw, unprocessed data
- All formats: JSON, CSV, images, videos, logs
- Serves data scientists and ML engineers
- Flexible but requires expertise
Example use cases:
- Machine learning training data
- IoT sensor data
- Log analysis and observability
- Advanced analytics by data scientists
Typical tools: S3 + Athena, Azure Data Lake, Databricks, Delta Lake
The Decision Framework
Here are the questions I ask clients to figure out what they actually need:
Question 1: How big is your data?
- Under 10 GB: Data mart is almost certainly enough
- 10 GB to 1 TB: Data mart or small warehouse
- 1 TB to 100 TB: Data warehouse territory
- 100 TB or more: You might need a data lake
Question 2: Who will use it?
- One team with specific needs: Data mart
- Multiple departments with reporting needs: Data warehouse
- Data scientists doing ML/experimentation: Data lake
- Mix of all three: Layered architecture (we'll get to this)
Question 3: What formats is your data in?
- Mostly tabular (databases, CSVs): Mart or warehouse
- JSON, APIs, structured logs: Warehouse with semi-structured support
- Images, videos, text documents, mixed formats: Data lake
Question 4: How real-time do you need it?
- Daily or weekly refreshes are fine: Mart or warehouse
- Near real-time (minutes): Modern warehouse (Snowflake, BigQuery)
- True real-time (seconds): Specialized streaming architecture
Question 5: What's your budget?
- Under 500 dollars per month: Data mart only
- 500 to 5000 dollars per month: Small warehouse or data mart
- 5000 to 50000 dollars per month: Full warehouse with proper ETL
- 50000 or more per month: You can afford anything
Real-World Cost Comparison
Let me give you actual numbers from recent projects:
Small Business E-commerce (Data Mart)
Setup:
- PostgreSQL on DigitalOcean
- dbt for transformations
- Metabase for dashboards
- 50 GB of data
Monthly cost: 125 dollars
What they got: Sales analytics, inventory reports, customer insights
Mid-Size SaaS Company (Data Warehouse)
Setup:
- Snowflake warehouse
- Fivetran for data ingestion
- dbt for transformations
- Looker for dashboards
- 2 TB of data
Monthly cost: 4200 dollars
What they got: Unified analytics across product, sales, marketing, finance
Enterprise Company (Lake + Warehouse)
Setup:
- AWS S3 data lake
- Snowflake warehouse
- Databricks for ML
- Airflow for orchestration
- 500 TB total
Monthly cost: 45000 dollars
What they got: Full data platform for analytics and machine learning
Notice the pattern? Most businesses don't need the enterprise setup. They need a good data mart or a modest warehouse.
The Trap: Building Too Much Too Soon
Here's a mistake I see constantly: Companies hear "data warehouse" at a conference and decide they need one. Then they spend six months and 100K dollars building something they never use.
The reality:
- Most companies have less data than they think
- Business users want answers, not infrastructure
- Complex systems rot without dedicated engineers
- Dashboards are what people actually want
My advice: Start with a data mart. Graduate to a warehouse when you outgrow it.
A Practical Migration Path
Here's the path I typically recommend:
Stage 1: Single Data Mart (Month 1-3)
Start with one focused use case—usually the one that's causing the most pain.
-- Example: A sales data mart
CREATE SCHEMA sales_mart;
CREATE TABLE sales_mart.daily_revenue (
date DATE PRIMARY KEY,
total_revenue DECIMAL(10,2),
total_orders INTEGER,
average_order_value DECIMAL(10,2),
top_product_id INTEGER
);
-- Pre-aggregated for fast dashboard queries
CREATE TABLE sales_mart.customer_segments (
customer_id INTEGER PRIMARY KEY,
segment VARCHAR(50),
lifetime_value DECIMAL(10,2),
first_order_date DATE,
last_order_date DATE,
order_count INTEGER
);Tools: PostgreSQL, dbt, Metabase Cost: 100-500 dollars per month Time to value: 2-4 weeks
Stage 2: Multiple Data Marts (Month 4-9)
As other teams see value, build focused marts for them too.
your_database/
├── sales_mart/ (for sales team)
├── marketing_mart/ (for marketing team)
├── finance_mart/ (for finance team)
└── product_mart/ (for product team)
Key insight: Each mart can stay simple. They don't need to share schemas or complex joins.
Stage 3: Connected Warehouse (Month 10-18)
When teams need to combine data across marts, introduce a proper warehouse with dimensional modeling.
-- Unified dimensions shared across the warehouse
CREATE TABLE dim_customer (
customer_key BIGINT PRIMARY KEY,
customer_id VARCHAR(50),
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255),
segment VARCHAR(50),
acquisition_channel VARCHAR(100),
lifetime_value DECIMAL(10,2)
);
-- Fact tables reference shared dimensions
CREATE TABLE fact_orders (
order_key BIGINT PRIMARY KEY,
customer_key BIGINT REFERENCES dim_customer,
product_key BIGINT REFERENCES dim_product,
date_key INTEGER REFERENCES dim_date,
order_amount DECIMAL(10,2),
quantity INTEGER
);Tools: Snowflake or BigQuery, dbt, Fivetran, Looker Cost: 2000-8000 dollars per month Time to value: 3-6 months
Stage 4: Add a Lake (Month 18+)
Only add a data lake when you have specific needs: ML training data, unstructured content, massive scale.
# Data lake organization
s3://company-data-lake/
├── raw/ # Untouched source data
│ ├── salesforce/
│ ├── stripe/
│ └── product_events/
├── processed/ # Cleaned, partitioned
│ ├── customer_events/
│ └── transaction_data/
└── ml_training/ # Feature stores for ML
├── churn_prediction/
└── recommendation_engine/When this makes sense:
- You have machine learning in production
- You store images, videos, or documents
- You have real-time streaming data
- You have a dedicated data engineering team
Modern Stack Recommendations by Budget
Budget: Under 500 dollars per month (Small Business)
Pick: Data Mart
- Database: PostgreSQL (Supabase or DigitalOcean)
- Transformation: dbt Core (free)
- Dashboards: Metabase or Grafana (free)
- ETL: Airbyte Cloud or custom scripts
Budget: 500-5000 dollars per month (Growing Business)
Pick: Small Data Warehouse
- Database: Snowflake or BigQuery
- Transformation: dbt Cloud
- Dashboards: Metabase Pro or Mode
- ETL: Fivetran or Airbyte
Budget: 5000-50000 dollars per month (Enterprise)
Pick: Full Data Warehouse
- Database: Snowflake, BigQuery, or Redshift
- Transformation: dbt Cloud
- Dashboards: Looker or Tableau
- ETL: Fivetran + custom pipelines
- Orchestration: Airflow or Prefect
Budget: 50000 or more per month (Large Enterprise)
Pick: Lake + Warehouse Architecture
- Lake: AWS S3 or Azure Data Lake
- Warehouse: Snowflake or Databricks
- ML Platform: Databricks or SageMaker
- Orchestration: Airflow
- Governance: Collibra or Atlan
Common Mistakes to Avoid
Mistake 1: "We need all our data accessible"
No, you don't. You need answers. Accessible raw data often means overwhelming complexity.
Fix: Focus on the questions business users actually ask. Build for those.
Mistake 2: "Data lake because we might need it later"
Data lakes without clear governance become data swamps. Expensive, messy, and useless.
Fix: Don't build a lake until you have a specific ML or unstructured data use case.
Mistake 3: "Let's build the warehouse first, then the dashboards"
Business users lose patience waiting for infrastructure. They start pulling data into spreadsheets again.
Fix: Build vertical slices. One data mart that serves one dashboard. Then expand.
Mistake 4: "We'll figure out the schema later"
Unclear schemas create technical debt that compounds.
Fix: Use dimensional modeling from day one. Even for small marts.
Mistake 5: "Real-time everything!"
Real-time is expensive and rarely necessary. Most "real-time" requests mean "within 15 minutes."
Fix: Start with daily refreshes. Upgrade only when there's clear business value.
The Modern Data Stack in 2026
Here's what I'm using and recommending this year:
For Transformations
dbt (Data Build Tool)—Still the best way to transform data with SQL and version control. Works with everything.
For Ingestion
Fivetran for connected SaaS sources. Airbyte for custom or on-prem. Custom scripts for unique APIs.
For Storage
BigQuery if you're on Google Cloud and want serverless. Snowflake if you want flexibility and separation of compute and storage. Postgres if you're just starting out.
For Dashboards
Metabase for general business users. Looker for deeper exploration. Tableau for complex visualizations. Custom dashboards (React, Chart.js) for embedded analytics.
For Orchestration
Airflow for complex pipelines. Prefect or Dagster for modern alternatives. Simple cron jobs for small projects.
When to Bring in Experts
Build in-house when:
- You have clear internal use cases
- Your data team understands business needs
- Technology changes quickly in your industry
Bring in consultants when:
- You're setting up from scratch
- You need to migrate from legacy systems
- Your team is building for the first time
- You need specialized ML or real-time expertise
Decision Framework: Your Action Plan
Based on your situation, here's where to start:
If you're pre-data (everything in spreadsheets):
- Identify the ONE question that matters most
- Build a simple PostgreSQL database
- Pipe data in with a simple script or Airbyte
- Create a Metabase dashboard
- Total time: 2 weeks. Total cost: Under 200 dollars per month
If you have some data but no strategy:
- Audit what data you have and who uses it
- Identify your highest-value use case
- Build a focused data mart for that use case
- Expand to adjacent teams once it's working
If you have multiple teams wanting data:
- Don't try to unify everything at once
- Build separate data marts per team
- Share common dimensions (customer, product, etc.)
- Add a warehouse only when integration becomes painful
If you have ML ambitions:
- Start with a warehouse for analytics
- Prove ROI on basic analytics first
- Then add a lake for ML training data
- Dedicate engineering resources to maintain it
The Bottom Line
Data architecture is about solving business problems, not implementing fancy technology.
Most businesses need:
- One or two focused data marts
- Simple dashboards that answer real questions
- Basic ETL that runs daily or hourly
- Clear ownership and documentation
Most businesses don't need:
- A data lake
- Real-time streaming
- Complex multi-cluster architectures
- Data science platforms without data scientists
Start small. Solve one problem well. Expand when the value is clear.
The companies with the best data cultures aren't the ones with the fanciest infrastructure—they're the ones where business users get answers quickly and trust the data they see.
Resources
- dbt Learn - Free courses on modern data transformation
- The Analytics Engineering Guide
- Kimball Group - Dimensional modeling classics
- Modern Data Stack - Tool ecosystem overview
- A Practical Guide to Data Warehousing
Working through a data architecture decision? I help businesses navigate these choices every week. Drop a message through the contact form with your specific situation, and I'll share what's worked for similar companies.
What's your current data setup? What's the biggest problem you're trying to solve?