Skip to main content
search iconsearch icon
Type something to search...

Data pipelines.
Written plainly.
By an engineer.

I'm Arnau Villoro, a data engineer in Barcelona. I write about Python, dbt, AWS — and the quiet habits that keep production boring. 8 years of notes, 68 posts, zero filler.

Solving the problem with small files in the Data Lake Most read

Solving the problem with small files in the Data Lake

This post addresses the common problem of small files in data lakes, which can lead to significant performance degradation and increased costs. It provides an in-depth guide on understanding the issues caused by small files, determining optimal file sizes, and effectively managing file sizes using tools like Apache Spark, AWS Athena, Delta Lake, and Apache Iceberg. The post also covers strategies for tracking file sizes and partitioning tables to optimize data processing and storage efficiency.

Read the post
New post · How to Be Nice to the Data Team
68 posts published
8 years writing
Crafting solutions. Delivering efficiency.
Latest: 2 days ago · DE

Latest writing.

Honest notes from production · updated monthly

See all 68 posts