Homepage
Open in app
Sign in
Get started
Data Engineer Things
Insights and ideas on data and engineering.
ETL
Data Architecture
Optimization
Interview Guide
Career Growth
AI in Data Engineering
About
Contribute
Follow
Following
Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice
Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice
Explore the potential of Large Language Models (LLMs) changing the future of mental healthcare and learn how to fine-tune LLMs by example
Volker Janz
May 19
Trending Now
How Twitter processes 4 billion events in real-time daily
How Twitter processes 4 billion events in real-time daily
From Lambda to Kappa
Vu Trinh
May 25
I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II
I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II
Question: Using MySQL’s public employee sample database, create a DAG to move data from the employee’s table to BigQuery.
Jennifer Ebe
Apr 27
Pydantic for Experts: Reusing & Importing Validators
Pydantic for Experts: Reusing & Importing Validators
Advanced techniques for reusing and importing validation across python models.
Yaakov Bressler
Apr 21
Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling
Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling
Exploring the transition from composite to surrogate keys for enhanced performance and maintainability in data warehousing.
Ivanna Ditlevsen Jurkiv
Apr 25
Why did Databricks build the Photon engine?
Why did Databricks build the Photon engine?
The Lakehouse, its motivation, and the difference between Photon and the existing engine.
Vu Trinh
Apr 6
I completed a Senior Data Engineer Code Challenge for fun, and this is how it went.
I completed a Senior Data Engineer Code Challenge for fun, and this is how it went.
Question: Using MySQL's public employee sample database, create a DAG to move data from the employee's table to BigQuery. There should also…
Jennifer Ebe
Mar 2
Latest stories
Bloom Filter: In-Short
Bloom Filter: In-Short
Set.contains() at scale with some False Positives
Susmit
May 30
Test Driven Development for Data Engineering (Part 1)
Test Driven Development for Data Engineering (Part 1)
How to write unit tests for data engineering
Yaakov Bressler
May 28
Real-Time Data Processing: Spark Streaming vs. Flink
Real-Time Data Processing: Spark Streaming vs. Flink
Choosing the right tool for handling big data in real-time
Steffi Christopher
May 28
The Hadoop Distributed File System
The Hadoop Distributed File System
Everything you need to know about the HDFS
Vu Trinh
May 24
Granular Look at Left, Semi, and Anti Joins in PySpark
Granular Look at Left, Semi, and Anti Joins in PySpark
In data operations, understanding the inner-working of the various types of joins can optimize query performance and accuracy. Spark…
Nicholas Piesco
May 20
Understanding Snowflake Table Locks
Understanding Snowflake Table Locks
A hands-on look at table locks.
Jonathan Duran
May 16
Automate Dbt Date Logic with Python — Part 2
Automate Dbt Date Logic with Python — Part 2
Simplifying Our Models and Tests From Part 1 Using Meta Config
Leo Godin
May 14
The Inheritance Schema Design Pattern for MongoDB Data Modelling
The Inheritance Schema Design Pattern for MongoDB Data Modelling
In the world of NoSQL databases, particularly MongoDB, designing an efficient data model is crucial for optimal application performance…
Karen Zhang
May 12
How I build an ETL pipeline with AWS Glue, Lambda, and Terraform
How I build an ETL pipeline with AWS Glue, Lambda, and Terraform
A Step-by-Step Guide
Lorena Gongang
May 12
Enhance your data quality tests with the dataform-assertions package
Enhance your data quality tests with the dataform-assertions package
dbt is no longer the only choice for testing data pipelines
Fumiaki Kobayashi
May 12
My Data Pipeline Orchestrators Journey
My Data Pipeline Orchestrators Journey
Originally Posted at: www.junaideffendi.com
Junaid Effendi
May 5
I spent 5 hours understanding more about the Delta Lake table format
I spent 5 hours understanding more about the Delta Lake table format
All insights from the paper: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores
Vu Trinh
May 4
What is something we have but don’t own and is never working when you need it.
What is something we have but don’t own and is never working when you need it.
Testing is difficult but pains could be eased with unified tooling. Here we explore the pros and cons of testing with new tools to help us
Peter Flook
May 2
Installing (and Switching between) Different Versions of Python
Installing (and Switching between) Different Versions of Python
How to install and switch between different python versions.
Yaakov Bressler
May 1
How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly
How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly
Migrating our data warehouse to Greenplum enables us to access data from Hive in real-time, eliminate storage issue, and much more!
Bernard Adhitya
Apr 26
Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)
Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)
TL;DR: It’s a challenging market, yet it holds promising prospects.
Yingjun Wu
Apr 25
Introduction to Apache Iceberg | PySpark
Introduction to Apache Iceberg | PySpark
The Story Behind a Data Lake
Pavan Kumar
Apr 25
Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI
Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI
Discover the basics of using Gemini with Python via VertexAI, creating a Web UI with NiceGUI and using Jinja2 to construct modular prompts
Volker Janz
Apr 25
Do We Need the Lakehouse Architecture?
Do We Need the Lakehouse Architecture?
When data lakes and data warehouses are not enough.
Vu Trinh
Apr 20
AWS Glue: essential tips for enhancing ETL development and operations
AWS Glue: essential tips for enhancing ETL development and operations
Explore 12 essential tips for Data Engineers and ETL Developers using AWS Glue
George Matheou
Apr 17
Best Practices for Writing Maintainable and Testable Spark Code in Scala
Best Practices for Writing Maintainable and Testable Spark Code in Scala
Enhancing Scalability and Reliability Through Structured Spark Development Practices
Thomas Cardenas
Apr 17
How to think about Internal Data Products as a Data Engineer
How to think about Internal Data Products as a Data Engineer
Data Products are all the rage, but why?
Hugo Lu
Apr 17
A Closer Look Into Databricks’s Photon Engine
A Closer Look Into Databricks’s Photon Engine
Part 2 of Databricks’s Photon paper note: Vectorization
Vu Trinh
Apr 13
What the Heck is MLOps?
What the Heck is MLOps?
In summary, by the end of this blog, you should be able to explain to someone:
Ronan Walters
Apr 11
Memory Management in Apache Spark
Memory Management in Apache Spark
Apache Spark’s performance advantage over MapReduce is greatest in the use-cases involving repeated computations. Much of this performance…
Solon Das
Apr 11
About Data Engineer Things
Latest Stories
Archive
About Medium
Terms
Privacy
Teams