Skip to main content
  1. posts/

Homelab Journey: Building a Safe Practice Environment for Data Engineering Projects

Introduction
#

Why a Homelab?
#

Since I decided to persue a carreer in Data Engeneering, I knew I’d have to work with on premice (local) and cloud provided tools. I learned how Cloud providers make it way simpler to deal with infrastructure, but also how costy it could be if you made a mistake. After some research, I found out that many cloud tools are modified implementation of open source projects/tools and came across how you could use Docker to host this tools locally.

Going furder with my research I found whole communities about Homelabs and SelfHosting. So, why not build my own practice cloud service? On my journey I would learn about infrastructure (way more than in thought initially), practice interacting with external services over network, cause and deal with (lots of) errors, understand the basics berrind cloud provider absctractions. I just needed to have patience and (a lot of) time to learn.

So, why not? Right?

Choosing the Right Hardware
#

This topic is a bit overwelming once you dive deep into homelabs and servers. You will find lab hardwares from single Raspberry pi to Multi node Xeonclusters. Belive me, some people spend a lot in their setups. I’m almost 2 years iterating this project and began with just an old laptop I had spare and than added a new node (my former gaming pc) with more robust specs.

Hardware Components
#

Node 01 - Laptop

  • Intel i7 6500u
  • 20gb DDR4 RAM (4gb + 16gb)
  • 256gb SATA SSD

Node 02 - Desktop

  • Intel i5 12400
  • 128gb DDR4 RAM
  • 8gb VRAM (GTX 1070ti)
  • 1tb NVME SSD + 512gb SATA SSD

Considerations and Constraints
#

  • The tools you choose to use will determine how much power your hardware must have. For me, this laptop had enough.
  • Laptops are not ment to be servers, you will need extra knolege to deal with them. But not much.
  • Laptops are power efficient and its batterry work as its own UPS.
  • Its chalenging but also fun to deal with hardware. You must like it to follow the same path I did.
  • You will always be looking for something to upgrade, even if you don’t need.

Setting Up the Homelab
#

My objective with this post is not to make a tutorial, but to give an overall view of the experience.

Choose your OS
#

There are some options on the market, I chose to install Proxmox on my first node (laptop), and than Ubuntu Server on my second (desktop).

  • Proxmox is a Hypervisor, my anderstanding is that it is an OS to host other OSs through Virtual Machines (VM) or Containers (LXC Containers). Its more related to DevOps an have way more settings to deal with. With it you create VMs, manage resources, network and storage. In it I creates VMs with Ubuntu Server, where I deployed my services.
  • Ubuntu Server is a Linux distro for servers. It works as any linux, but you only interact with it via terminal. Its the same under many AWS EC2 instances.

SSH
#

  • A server is not meant to be directly accessed. Make sure you can access it over SSH (its how you do with cloud VMs too).

Install Docker
#

  1. Install Docker and Docker Compose. It must be one of the most importants tool I learned with this project. Its also the easyest way to deploy services.

Download Docker Images
#

  • Chose what services I wanted to host and downloaded their Docker Images.
  • Used Docker-Compose.yml files and Dockge to manage my stacks and containers. (Its an alternative to portainer if you like using docker compose)

Useful extras
#

  1. Install VPN so you could access from outside of my home network. I used Tailscale. With it you can communicate with you server node like your are in your local network.
  2. Hosted a Reverce Proxy so I could access my services using a domain name, not having to remember the IP and PORT of every service I host. I use Nginx Proxy Manager
  3. Cloudflare tunnels allowed me to expose services on internet in a safe way, like this blog post you are reading.

Challenges and Solutions
#

It was challenging to maintain focus on my main objective I made the mistake of going too deep into infrastructure and at some point I wasn’t practicing Data Engineering anymore, I was more like being a DevOps Engineer. Dealing with servers is a whole new knowledge stack, keep that in mind.

Self-Hosted Services
#

Most of the services hosted in my servers can also be found under big Cloud Providers like (AWS, Azure an GCP).

Service 1: MinIO (object storage)
#

  • MinIO is an, open-source, Amazon S3-compatible object storage service. It allows you to store and manage large amounts of unstructured data, such as images, videos, and files. With its S3 API compatibility, you can use MinIO as a drop-in replacement for Amazon S3.

Service 2: Postgres SQL (relational database)
#

  • Postgre is am, open-source, relational database management system (RDBMS) that enables you to store and manage structured data.

Service 3: Apache Airflow (orchestrator)
#

  • Apache Airflow is an open-source platform for programmatically defining, scheduling, and monitoring workflows.

Service 4: Apache Kafka (event streaming)
#

  • Apache Kafka is an, open-source, distributed event streaming platform that enables you to publish, subscribe, store, and process event-driven data at scale.

Service 5: Apache Spark (distributed analytics engine)
#

  • Apache Spark is an, open-source, analytics engine for large-scale data processing that provides high-level APIs in Java, Python, Scala, and R. It is designed to handle massive datasets, making it an ideal choice for big data processing and machine learning workloads.

Using My Homelab for Data Engineering Projects
#

Project 1: E-Commerce Streaming Data Pipelines
#

Project 2: Loading structured data into Databse with python
#

Project 3: Gather Linkedin Jobs data with crowlers and scraping
#

Conclusion
#

I have spent a lot of time learning and building my Homelab. I can say that this experience is not meant for all tech people. However, if you have the oportunity to give it a go, you will benefit a lot from learning what is under the abstractions ran by Cloud Providers. Also, I can experiment all kind of services without fear. If I make a mistake, no problem, just delete the instance and try again.

Feel free to contact me on my social medias if you want to know more about my experience.