Data pipelines.
Written plainly.
By an engineer.

I'm Arnau Villoro, a data engineer in Barcelona. I write about Python, dbt, AWS — and the quiet habits that keep production boring. 8 years of notes, 68 posts, zero filler.

Read the blog About me

Solving the problem with small files in the Data Lake

Solving the problem with small files in the Data Lake

This post addresses the common problem of small files in data lakes, which can lead to significant performance degradation and increased costs. It provides an in-depth guide on understanding the issues caused by small files, determining optimal file sizes, and effectively managing file sizes using tools like Apache Spark, AWS Athena, Delta Lake, and Apache Iceberg. The post also covers strategies for tracking file sizes and partitioning tables to optimize data processing and storage efficiency.

New post · How to Be Nice to the Data Team

68 posts published

8 years writing

Crafting solutions. Delivering efficiency.

Latest: 2 days ago · DE

Latest writing.

Honest notes from production · updated monthly

See all 68 posts

Latest⚙️ DE · 05 May 2026 · 09 Mins read

How to Be Nice to the Data Team

A practical guide for product, engineering, ops, and business teams on how to design source systems and frame requests so the data team can deliver faster and more reliably.

Marimo notebooks for Python projects

🛠️ Tools/Utils

Marimo notebooks for Python projects

15 Apr 2026 · 06 Mins read

Recovering files from S3 using Delete Markers

⚙️ DE

Recovering files from S3 using Delete Markers

17 Mar 2026 · 05 Mins read

Protecting Production Tables in dbt

⚙️ DE

Protecting Production Tables in dbt

17 Feb 2026 · 04 Mins read

Scalable GitHub Actions for Modern Repos

☁️ Cloud/DevOps

Scalable GitHub Actions for Modern Repos

14 Jan 2026 · 04 Mins read

Scaling ECS Python Deployments with a Modular Monorepo

☁️ Cloud/DevOps

Scaling ECS Python Deployments with a Modular Monorepo

09 Dec 2025 · 04 Mins read

Fast and Reproducible Python Deployments on ECS with uv

☁️ Cloud/DevOps

Fast and Reproducible Python Deployments on ECS with uv

25 Nov 2025 · 09 Mins read