Sign in Get started

Data Engineer Things

Insights and ideas on data and engineering.

Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice

Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice

Explore the potential of Large Language Models (LLMs) changing the future of mental healthcare and learn how to fine-tune LLMs by example

May 19

Trending Now

How Twitter processes 4 billion events in real-time daily

How Twitter processes 4 billion events in real-time daily

From Lambda to Kappa

May 25

I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II

I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II

Question: Using MySQL’s public employee sample database, create a DAG to move data from the employee’s table to BigQuery.

Apr 27

Pydantic for Experts: Reusing & Importing Validators

Pydantic for Experts: Reusing & Importing Validators

Advanced techniques for reusing and importing validation across python models.

Yaakov Bressler

Apr 21

Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling

Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling

Exploring the transition from composite to surrogate keys for enhanced performance and maintainability in data warehousing.

Ivanna Ditlevsen Jurkiv

Apr 25

Why did Databricks build the Photon engine?

Why did Databricks build the Photon engine?

The Lakehouse, its motivation, and the difference between Photon and the existing engine.

Apr 6

I completed a Senior Data Engineer Code Challenge for fun, and this is how it went.

I completed a Senior Data Engineer Code Challenge for fun, and this is how it went.

Question: Using MySQL's public employee sample database, create a DAG to move data from the employee's table to BigQuery. There should also…

Mar 2

Latest stories

Bloom Filter: In-Short

Bloom Filter: In-Short

Set.contains() at scale with some False Positives

May 30

Test Driven Development for Data Engineering (Part 1)

Test Driven Development for Data Engineering (Part 1)

How to write unit tests for data engineering

Yaakov Bressler

May 28

Real-Time Data Processing: Spark Streaming vs. Flink

Real-Time Data Processing: Spark Streaming vs. Flink

Choosing the right tool for handling big data in real-time

Steffi Christopher

May 28

The Hadoop Distributed File System

The Hadoop Distributed File System

Everything you need to know about the HDFS

May 24

Granular Look at Left, Semi, and Anti Joins in PySpark

Granular Look at Left, Semi, and Anti Joins in PySpark

In data operations, understanding the inner-working of the various types of joins can optimize query performance and accuracy. Spark…

Nicholas Piesco

May 20

Understanding Snowflake Table Locks

Understanding Snowflake Table Locks

A hands-on look at table locks.

May 16

Automate Dbt Date Logic with Python — Part 2

Automate Dbt Date Logic with Python — Part 2

Simplifying Our Models and Tests From Part 1 Using Meta Config

May 14

The Inheritance Schema Design Pattern for MongoDB Data Modelling

The Inheritance Schema Design Pattern for MongoDB Data Modelling

In the world of NoSQL databases, particularly MongoDB, designing an efficient data model is crucial for optimal application performance…

May 12

How I build an ETL pipeline with AWS Glue, Lambda, and Terraform

How I build an ETL pipeline with AWS Glue, Lambda, and Terraform

A Step-by-Step Guide

May 12

Enhance your data quality tests with the dataform-assertions package

Enhance your data quality tests with the dataform-assertions package

dbt is no longer the only choice for testing data pipelines

Fumiaki Kobayashi

May 12

My Data Pipeline Orchestrators Journey

My Data Pipeline Orchestrators Journey

Originally Posted at: www.junaideffendi.com

May 5

I spent 5 hours understanding more about the Delta Lake table format

I spent 5 hours understanding more about the Delta Lake table format

All insights from the paper: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

May 4

What is something we have but don’t own and is never working when you need it.

What is something we have but don’t own and is never working when you need it.

Testing is difficult but pains could be eased with unified tooling. Here we explore the pros and cons of testing with new tools to help us

May 2

Installing (and Switching between) Different Versions of Python

Installing (and Switching between) Different Versions of Python

How to install and switch between different python versions.

Yaakov Bressler

May 1

How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly

How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly

Migrating our data warehouse to Greenplum enables us to access data from Hive in real-time, eliminate storage issue, and much more!

Bernard Adhitya

Apr 26

Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)

Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)

TL;DR: It’s a challenging market, yet it holds promising prospects.

Apr 25

Introduction to Apache Iceberg | PySpark

Introduction to Apache Iceberg | PySpark

The Story Behind a Data Lake

Apr 25

Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI

Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI

Discover the basics of using Gemini with Python via VertexAI, creating a Web UI with NiceGUI and using Jinja2 to construct modular prompts

Apr 25

Do We Need the Lakehouse Architecture?

Do We Need the Lakehouse Architecture?

When data lakes and data warehouses are not enough.

Apr 20

AWS Glue: essential tips for enhancing ETL development and operations

AWS Glue: essential tips for enhancing ETL development and operations

Explore 12 essential tips for Data Engineers and ETL Developers using AWS Glue

Apr 17

Best Practices for Writing Maintainable and Testable Spark Code in Scala

Best Practices for Writing Maintainable and Testable Spark Code in Scala

Enhancing Scalability and Reliability Through Structured Spark Development Practices

Thomas Cardenas

Apr 17

How to think about Internal Data Products as a Data Engineer

How to think about Internal Data Products as a Data Engineer

Data Products are all the rage, but why?

Apr 17

A Closer Look Into Databricks’s Photon Engine

A Closer Look Into Databricks’s Photon Engine

Part 2 of Databricks’s Photon paper note: Vectorization

Apr 13

What the Heck is MLOps?

What the Heck is MLOps?

In summary, by the end of this blog, you should be able to explain to someone:

Apr 11

Memory Management in Apache Spark

Memory Management in Apache Spark

Apache Spark’s performance advantage over MapReduce is greatest in the use-cases involving repeated computations. Much of this performance…

Apr 11

About Data Engineer ThingsLatest StoriesArchiveAbout MediumTermsPrivacyTeams