I know that some of you are struggling to find reliable projects to sharpen your skills in the big data science world , a lot of concepts, tools, frameworks and algorithms, so it is better to stick to a relevant path and find the right resources . Don’t worry I got you just follow on this tutorial, it is a good capstone project that I did inside our company , Ancud IT.
Figure 1 : Project architecture
As you can see above (Figure 1) , I highlighted the major steps of the project :
I see, these days, a lot of students, engineers computer science enthusiasts buying a lot of online courses from audacity, Udemy and ending up losing money and time, but don’t worry, I’m here for you to feed your knowledge with what I have learned.
In this article , we are going to break down how to build a storage layer for any data science project in an on premise setup , which is way more difficult it in any cloud provider , it is important to know about them but it is better to master the basic skills on your local machine before implementing fancy projects in GCP , AWS or AZURE .
IT Advice : before running into big projects , try to match the small pieces in smaller projects , what I mean is don’t get yourself involved in complex tasks without having a deep understanding of what’s going on .
Every modern IA project , needs a storage layer , it is where the gold is being stored . so I’m going to walk you through on how to do it step by step .
Requirements :
1. Update the server :
2. Install kubelet , kubeadm and kubectl :
3. Then install required packages :
4. Check the version of kubectl & kubeadm :
5. Disable Swap and Enable Kernel modules :
6. Add settings to systctl :
7. Reload sysctl :
8. Install Container runtime for our cluster , we have three options here Docker , CRI-O , Containerd:
9. Update our repositories :
10. Display the installed version :
11. enable CRI-O :
12. confirm the runtime status :
13. Initialize master node :
We need to enable transparent masquerading faciliate virtual extensible LAN (VxLAN) traffic for communication between kubernetes pods across the cluster nodes .
14. let us also enable the kubelet service to start on boot :
15. Pull the required images :
16. start the cluster :
17. Install flannel :
Now , we are done with the kubernetes cluster , let’s deploy minio , but before that let’s read together this architecture , don’t worry I’m going to make things clear and simple :
1. Install krew plugin to install required packages
2. Preparing drives for MinIO with DirectPV :
3. Select the drivers to be formatted :
4. Install MinIO Operator :
Variables to make :
5. Create the minio operator and the tenant :
6. Check the pods in ‘minio-operator’ & ‘minio-tenant-1’ namespaces :
7. Get temporary access through the following command :
Setting up the storage layer was the first step in this project , in the upcoming articles I’m going to break down one of the most important workflow management in the data engineering field : Apache airflow , not only that , I’m going also to explain how to monitor data pipelines using prometheus and grafana .
At the end, dear reader, I would like to thank you for reading my first article and I hope it was informative for you, stick around because the best is yet to come.