Skip to content

datadriven-io/data-engineer-interview-prep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Data Engineer Interview Prep

An 8 week structured practice track for data engineering interviews. Weekly schedule, graded problem sets, self assessment rubric, and a journal template.

Stars License PRs welcome Sandbox

Schedule · Weekly cycle · Scoring · Journal · Companion repos


Reading prep material is necessary but not sufficient. Doing problems on a schedule with self assessment is what produces results. This repo is the schedule.

Six hours per week for eight weeks. Every problem links to a runnable browser sandbox at datadriven.io. No local setup.

Schedule

Week Phase Focus Target avg score Primer Practice
1 Foundations SQL fundamentals: joins, aggregating, dates 1.5 joins, aggregating SQL bank, 9 problems
2 Foundations Python data wrangling 1.5 foundations, collections Python bank, 9 problems
3 Patterns SQL window functions 2.0 window functions Window drill, 9 problems
4 Patterns Python: sessionization, dedup, retries, partitioning 2.0 Python for DE 9 problems across patterns
5 Design Schema design 2.0 data modeling lessons (read all 8) 4 schema problems, sketched on paper
6 Design Pipeline architecture 2.0 system design framework 3 case studies, designed end to end
7 Polish Mocks and behavioral story bank 2.5 none One full mock loop, six STAR stories
8 Polish Company specific prep 2.5 Company guides 90 minutes per target company

Weekly cycle

Day Activity Time
Mon Read the focus primer 30 min
Tue Problem set 1 (3 easy/medium) 60 min
Wed Problem set 2 (3 medium) 60 min
Thu Problem set 3 (3 hard) 90 min
Fri Self review and journal entry 30 min
Sat Long form problem (schema or pipeline) 90 min
Sun Off

Total: about 6 hours per week. If you can only commit 4, drop Thursday.

Scoring

After each problem, score yourself 0 to 3:

Score Meaning
0 Could not start. Need to learn from scratch.
1 Solved with significant help. Need more reps.
2 Solved without help, took longer than expected.
3 Solved cleanly within time. Could explain it to a colleague.

Track scores in /journal/week-XX.md. Goal at the end of week 8: average 2.5.

If a week's average is below the target in the schedule, repeat the week before moving on.

Journal

Every Friday, fill in the template:

# Week XX

## Score summary
- SQL: avg X.X / 3
- Python: avg X.X / 3
- Schema: avg X.X / 3
- Pipeline: avg X.X / 3

## What I missed
- (concrete, problem by problem)

## What I will repeat next week
- (specific topics or patterns to drill again)

## One thing I learned that surprised me
- (1 to 2 sentences)

The journal is the program. Skip it and you are just doing problems.

Week by week deep dives

Week 1: SQL fundamentals

Drill 9 problems from datadriven.io/sql-interview-questions covering joins, aggregating, and date filtering. Start with 10 Lowest Uptime Services, 2FA Confirmation Rate, 30 Day Page View Counts.

Week 2: Python wrangling

Work through Batch Records, Column Sum, Activity Time Ledger, Batch Partitioner, Batch With Metadata.

Week 3: Window functions

Window functions show up in most senior SQL screens. Run the window functions drill timed. Target patterns: rolling totals, top N per group, sessionization, gaps and islands, percent of total, second to last X.

Week 4: Python patterns

Sessionization, dedup, hash partitioning, interval merging, retries with backoff, schema evolution, top N with ties, parsing semi structured logs, streaming aggregation. Pick 9 from the Python bank covering each.

Week 5: Schema design

Read all 8 data modeling lessons on Mon and Tue. Then sketch four schemas for 30 minutes each before reading the solution: A/B Experiment Assignment Schema, Customer Address History, Insurance Claims Lifecycle, Clickstream and Session Schema.

Week 6: Pipeline architecture

Memorize the eight beat framework. Sketch three pipelines end to end on paper for 45 minutes each: Card Transaction Streaming Pipeline, Cellular Connectivity and App Log Data Warehouse, and one from datadriven.io/data-pipeline-interview-questions matching your target company's stack.

Week 7: Mocks and behavioral

90 minute mock loop on Saturday: 25 min SQL, 25 min Python, 30 min design, 10 min behavioral. No partner? Record yourself and watch the recording. Write six STAR stories during the week.

Week 8: Company specific

For each target company, 90 minutes: read the loop guide, read three recent engineering blog posts, pick three problems from this bank that match the company's style, map two STAR stories to the company's leveling rubric.

Company Guide
Netflix companies/netflix/interview
Uber companies/uber/interview
Amazon companies/amazon/interview
Google companies/google/interview
Meta companies/meta/interview

When you finish

You should be able to:

  1. Write any window function from memory under time pressure.
  2. Implement any of the eight Python patterns without lookups.
  3. Defend a star schema with explicit grain and SCD choices.
  4. Walk through the eight beat framework on a new pipeline question.
  5. Tell six STAR stories without rambling.
  6. Articulate the loop structure of your top three target companies.

If yes to all six, schedule the loop.

Companion repos

Contributing

The schedule is a starting point. If you have a better one with evidence, open a PR. Include your timeline, target role, starting level, and what worked.

License

CC BY-SA 4.0. Sandboxes hosted at datadriven.io.

About

An 8 week practice track for data engineering interview prep. Weekly schedule, graded problem sets, self assessment rubric, and a journal template. Built for candidates preparing for senior DE loops.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors