Blog - Einleitung

Ancud Blog

Herzlich Willkommen zum Ancud Blog. Hier finden Sie eine Vielzahl von interessanten Artikeln zu verschiedenen Themen. Tauchen Sie ein in unsere Welt des Wissens! 

Blogs

Blogs

Data Engineers : I’m here for you ...

I know that some of you are struggling to find reliable projects to sharpen your skills in the big data science world , a lot of concepts, tools, frameworks and algorithms, so it is better to stick to a relevant path and find the right resources . Don’t worry I got you just follow on this tutorial, it is a good capstone project that I did inside our company , Ancud IT.

Figure 1 : Project architecture

As you can see above (Figure 1) , I highlighted the major steps of the project :

  • Data sources : extracting data from a website (structured , unstructured data )
  • ETL jobs : using one of the best workflow management frameworks whis is Apache airflow
  • Sink : build a storage layer for your project , if you are not familiar with any of the cloud providers (GCP : google ,AWS : amazon ,AZURE : microsoft )

I see, these days, a lot of students, engineers computer science enthusiasts buying a lot of online courses from audacity, Udemy and ending up losing money and time, but don’t worry, I’m here for you to feed your knowledge with what I have learned.

In this article , we are going to break down how to build a storage layer for any data science project in an on premise setup , which is way more difficult it in any cloud provider , it is important to know about them but it is better to master the basic skills on your local machine before implementing fancy projects in GCP , AWS or AZURE .

IT Advice : before running into big projects , try to match the small pieces in smaller projects , what I mean is don’t get yourself involved in complex tasks without having a deep understanding of what’s going on .

Every modern IA project , needs a storage layer , it is where the gold is being stored . so I’m going to walk you through on how to do it step by step .

Requirements :

  • Ubuntu 18.04 , 20.04

 

 

I. Set up the kubernetes cluster :

1. Update the server :

sudo apt updatesudo apt -y upgrade

 

2. Install kubelet , kubeadm and kubectl :

sudo apt -y install curl apt-transport-https
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo
apt-key add -

echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo
tee /etc/apt/sources.list.d/kubernetes.list

 

3. Then install required packages :

sudo apt update
sudo apt -y install vim git curl wget
sudo apt-get install kubectl=1.23.3-00 kubeadm=1.23.3-00 kubelet=1.23.3-00

 

4. Check the version of kubectl & kubeadm :

kubectl version --client && kubeadm version
 

 

5. Disable Swap and Enable Kernel modules :

sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
sudo swapoff -a
sudo modprobe overlay
sudo modprobe br_netfilter 

 

6. Add settings to systctl :

 

sudo tee /etc/sysctl.d/kubernetes.conf<<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF

 

7. Reload sysctl :

sudo sysctl --system

 

8. Install Container runtime for our cluster , we have three options here Docker , CRI-O , Containerd:

sudo -i
OS=xUbuntu_20.04
VERSION=1.22

echo "deb
https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/$OS/ /" >
/etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list

echo "deb
http://download.opensuse.org/repositories/devel:/kubic:/libcontainer
s:/stable:/cri-o:/$VERSION/$OS/ /" >
/etc/apt/sources.list.d/devel:kubic:libcontainers:stable:cri-
o:$VERSION.list

curl -L
https://download.opensuse.org/repositories/devel:kubic:libcontainers:
stable:cri-o:$VERSION/$OS/Release.key | apt-key add -

curl -L
https://download.opensuse.org/repositories/devel:/kubic:/libcontainer
s:/stable/$OS/Release.key | apt-key add -

 

9. Update our repositories :

sudo apt update
sudo apt install cri-o cri-o-runc

 

10. Display the installed version :

apt-cache policy cri-o

 

11. enable CRI-O :

 

sudo systemctl daemon-reload
sudo systemctl enable crio --now

 

12. confirm the runtime status :

sudo systemctl status crio

 

13. Initialize master node :

We need to enable transparent masquerading faciliate virtual extensible LAN (VxLAN) traffic for communication between kubernetes pods across the cluster nodes .

lsmod | grep br_netfilter

 

14. let us also enable the kubelet service to start on boot :

sudo systemctl enable kubelet

 

15. Pull the required images :

sudo kubeadm config images pull

 

16. start the cluster :

sudo kubeadm init --pod-network-cidr=10.10.0.0/16 --cri-socket
/var/run/crio/crio.sock

 

17. Install flannel :

kubectl apply -f https://raw.githubusercontent.com/flannel-
io/flannel/master/Documentation/kube-flannel.yml

 

Now , we are done with the kubernetes cluster , let’s deploy minio , but before that let’s read together this architecture , don’t worry I’m going to make things clear and simple :

Figure 2: Minio architecture with kubernetes
  • Kubernetes : framework to automate the deployment process
  • MinIO : delivers scalable, secure, S3 compatible object
  • RACK 1..n :where the data will get stored , the physical material that provides the ability to store things
  • MinIO tenant : it is a tenant ! it will manage the interaction between kubernetes and the app .
  • Application : it plays the role of the client , you can have bunch of features like upload files or any other type of document .
  • Administrators : it is the app or the comand line tool used to interact with the tenants


 

II. Deploy MinIO on kubernetes cluster :

 

1. Install krew plugin to install required packages 

 

2. Preparing drives for MinIO with DirectPV :

  • MinIO uses kubernetes PVs ; the PVC of minio will use these PVs .
  • MinIO DirectPV driver automates this by discovering , formating and mounting the cluster drives
kubectl krew install directpv
kubectl directpv install
  • Display the drives attached to the cluster :  
kubectl directpv drives ls

 

3. Select the drivers to be formatted :

  • We have to choose 𝐎𝐍𝐋𝐘 the unformated disks (Raw disks) , because directpv will manage the formatting task
  • This step is so important because it will prepare the disks to be used by the PVs and PVCs with MinIO
kubectl directpv drives format --drives /dev/sd{a...f} --nodes demo-
worker-0{1...3}

 

4. Install MinIO Operator :

  • We will start with installing helm charts , which is a tool for managing Kubernetes applications :
curl -fsSL -o get_helm.sh
https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3

sudo chmod 0700 get_helm.sh
./get_helm.sh
  • Download the operator yaml file :
curl -fsSL -o operator-values.yaml https://raw.githubusercontent.com/minio/operator/master/helm/operator/values.yaml
  • Download the tenant yaml file :
curl -fsSL -o tenant-values.yaml https://raw.githubusercontent.com/minio/operator/master/helm/tenant/values.yaml

Variables to make :

  • tenant.pools.servers : which is the number of pods that will host MinIO servers instances . In our case is 3 .
  • tenant.pools.volumesPerServer : the number of devices per server , which is 6 in our case
  • tenant.pools.storageClassName : this variable needs to be set to ‘directpv-min-io’
  • tenant.certificate.requestAutoCert : need to be set to false in order to allow minio client connect without certificates to object storage

 

5. Create the minio operator and the tenant :

helm repo add minio https://operator.min.io/

helm upgrade --install --namespace minio-operator --create-namespace
minio-operator minio/operator --values operator-values.yaml

helm upgrade --install --namespace minio-tenant-1 --create-namespace
minio-tenant-1 minio/tenant --values tenant-values.yaml

 

6. Check the pods in ‘minio-operator’ & ‘minio-tenant-1’ namespaces :

kubectl get pods -n minio-operator
kubectl get pods -n minio-tenant-1

 

7. Get temporary access through the following command :

kubectl port-forward --address 0.0.0.0 -n minio-operator
service/console 9090:9090
  • You can access to minio operator by downloading
  • To get the token :
     
  • kubectl -n minio-operator get secret console-sa-secret -o jsonpath="
    {.data.token}" | base64 --decode

Setting up the storage layer was the first step in this project , in the upcoming articles I’m going to break down one of the most important workflow management in the data engineering field : Apache airflow , not only that , I’m going also to explain how to monitor data pipelines using prometheus and grafana .

 

Conclusion :

At the end, dear reader, I would like to thank you for reading my first article and I hope it was informative for you, stick around because the best is yet to come.