Blogs

Blogs

Data Engineers : I’m here for you ...

I know that some of you are struggling to find reliable projects to sharpen your skills in the big data science world , a lot of concepts, tools, frameworks and algorithms, so it is better to stick to a relevant path and find the right resources . Don’t worry I got you just follow on this tutorial, it is a good capstone project that I did inside our company , Ancud IT.

Figure 1 : Project architecture

As you can see above (Figure 1) , I highlighted the major steps of the project :

  • Data sources : extracting data from a website (structured , unstructured data )
  • ETL jobs : using one of the best workflow management frameworks whis is Apache airflow
  • Sink : build a storage layer for your project , if you are not familiar with any of the cloud providers (GCP : google ,AWS : amazon ,AZURE : microsoft )

I see, these days, a lot of students, engineers computer science enthusiasts buying a lot of online courses from audacity, Udemy and ending up losing money and time, but don’t worry, I’m here for you to feed your knowledge with what I have learned.

In this article , we are going to break down how to build a storage layer for any data science project in an on premise setup , which is way more difficult it in any cloud provider , it is important to know about them but it is better to master the basic skills on your local machine before implementing fancy projects in GCP , AWS or AZURE .

IT Advice : before running into big projects , try to match the small pieces in smaller projects , what I mean is don’t get yourself involved in complex tasks without having a deep understanding of what’s going on .

Every modern IA project , needs a storage layer , it is where the gold is being stored . so I’m going to walk you through on how to do it step by step .

Requirements :

  • Ubuntu 18.04 , 20.04

 

 

I. Set up the kubernetes cluster :

1. Update the server :

sudo apt updatesudo apt -y upgrade

 

2. Install kubelet , kubeadm and kubectl :

sudo apt -y install curl apt-transport-https
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo
apt-key add -

echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo
tee /etc/apt/sources.list.d/kubernetes.list

 

3. Then install required packages :

sudo apt update
sudo apt -y install vim git curl wget
sudo apt-get install kubectl=1.23.3-00 kubeadm=1.23.3-00 kubelet=1.23.3-00

 

4. Check the version of kubectl & kubeadm :

kubectl version --client && kubeadm version
 

 

5. Disable Swap and Enable Kernel modules :

sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
sudo swapoff -a
sudo modprobe overlay
sudo modprobe br_netfilter 

 

6. Add settings to systctl :

 

sudo tee /etc/sysctl.d/kubernetes.conf<<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF

 

7. Reload sysctl :

sudo sysctl --system

 

8. Install Container runtime for our cluster , we have three options here Docker , CRI-O , Containerd:

sudo -i
OS=xUbuntu_20.04
VERSION=1.22

echo "deb
https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/$OS/ /" >
/etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list

echo "deb
http://download.opensuse.org/repositories/devel:/kubic:/libcontainer
s:/stable:/cri-o:/$VERSION/$OS/ /" >
/etc/apt/sources.list.d/devel:kubic:libcontainers:stable:cri-
o:$VERSION.list

curl -L
https://download.opensuse.org/repositories/devel:kubic:libcontainers:
stable:cri-o:$VERSION/$OS/Release.key | apt-key add -

curl -L
https://download.opensuse.org/repositories/devel:/kubic:/libcontainer
s:/stable/$OS/Release.key | apt-key add -

 

9. Update our repositories :

sudo apt update
sudo apt install cri-o cri-o-runc

 

10. Display the installed version :

apt-cache policy cri-o

 

11. enable CRI-O :

 

sudo systemctl daemon-reload
sudo systemctl enable crio --now

 

12. confirm the runtime status :

sudo systemctl status crio

 

13. Initialize master node :

We need to enable transparent masquerading faciliate virtual extensible LAN (VxLAN) traffic for communication between kubernetes pods across the cluster nodes .

lsmod | grep br_netfilter

 

14. let us also enable the kubelet service to start on boot :

sudo systemctl enable kubelet

 

15. Pull the required images :

sudo kubeadm config images pull

 

16. start the cluster :

sudo kubeadm init --pod-network-cidr=10.10.0.0/16 --cri-socket
/var/run/crio/crio.sock

 

17. Install flannel :

kubectl apply -f https://raw.githubusercontent.com/flannel-
io/flannel/master/Documentation/kube-flannel.yml

 

Now , we are done with the kubernetes cluster , let’s deploy minio , but before that let’s read together this architecture , don’t worry I’m going to make things clear and simple :

Figure 2: Minio architecture with kubernetes
  • Kubernetes : framework to automate the deployment process
  • MinIO : delivers scalable, secure, S3 compatible object
  • RACK 1..n :where the data will get stored , the physical material that provides the ability to store things
  • MinIO tenant : it is a tenant ! it will manage the interaction between kubernetes and the app .
  • Application : it plays the role of the client , you can have bunch of features like upload files or any other type of document .
  • Administrators : it is the app or the comand line tool used to interact with the tenants


 

II. Deploy MinIO on kubernetes cluster :

 

1. Install krew plugin to install required packages 

 

2. Preparing drives for MinIO with DirectPV :

  • MinIO uses kubernetes PVs ; the PVC of minio will use these PVs .
  • MinIO DirectPV driver automates this by discovering , formating and mounting the cluster drives
kubectl krew install directpv
kubectl directpv install
  • Display the drives attached to the cluster :  
kubectl directpv drives ls

 

3. Select the drivers to be formatted :

  • We have to choose 𝐎𝐍𝐋𝐘 the unformated disks (Raw disks) , because directpv will manage the formatting task
  • This step is so important because it will prepare the disks to be used by the PVs and PVCs with MinIO
kubectl directpv drives format --drives /dev/sd{a...f} --nodes demo-
worker-0{1...3}

 

4. Install MinIO Operator :

  • We will start with installing helm charts , which is a tool for managing Kubernetes applications :
curl -fsSL -o get_helm.sh
https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3

sudo chmod 0700 get_helm.sh
./get_helm.sh
  • Download the operator yaml file :
curl -fsSL -o operator-values.yaml https://raw.githubusercontent.com/minio/operator/master/helm/operator/values.yaml
  • Download the tenant yaml file :
curl -fsSL -o tenant-values.yaml https://raw.githubusercontent.com/minio/operator/master/helm/tenant/values.yaml

Variables to make :

  • tenant.pools.servers : which is the number of pods that will host MinIO servers instances . In our case is 3 .
  • tenant.pools.volumesPerServer : the number of devices per server , which is 6 in our case
  • tenant.pools.storageClassName : this variable needs to be set to ‘directpv-min-io’
  • tenant.certificate.requestAutoCert : need to be set to false in order to allow minio client connect without certificates to object storage

 

5. Create the minio operator and the tenant :

helm repo add minio https://operator.min.io/

helm upgrade --install --namespace minio-operator --create-namespace
minio-operator minio/operator --values operator-values.yaml

helm upgrade --install --namespace minio-tenant-1 --create-namespace
minio-tenant-1 minio/tenant --values tenant-values.yaml

 

6. Check the pods in ‘minio-operator’ & ‘minio-tenant-1’ namespaces :

kubectl get pods -n minio-operator
kubectl get pods -n minio-tenant-1

 

7. Get temporary access through the following command :

kubectl port-forward --address 0.0.0.0 -n minio-operator
service/console 9090:9090
  • You can access to minio operator by downloading
  • To get the token :
     
  • kubectl -n minio-operator get secret console-sa-secret -o jsonpath="
    {.data.token}" | base64 --decode

Setting up the storage layer was the first step in this project , in the upcoming articles I’m going to break down one of the most important workflow management in the data engineering field : Apache airflow , not only that , I’m going also to explain how to monitor data pipelines using prometheus and grafana .

 

Conclusion :

At the end, dear reader, I would like to thank you for reading my first article and I hope it was informative for you, stick around because the best is yet to come.