Deploy Kafka Cluster with Zookeeper on GCP — The hard way

Published in

The Cloudside View

7 min readJan 1, 2022

If you accidentally landed on this page looking for the original (Franz) Kafka, however unlikely it may seem, I have nothing to offer to you, except this wonderful quote. 😄

By believing passionately in something that still does not exist, we create it. The nonexistent is whatever we have not sufficiently desired
— Franz Kafka

But if you came here for Apache Kafka, I assume you already know what Kafka is. For the uninitiated, here is a fun illustrated introduction to Kafka, aptly named Gently Down The Stream. But don’t stop there, read a more serious introduction here.

With that out of the way, let's get straight to the point of this post. We recently had the opportunity of deploying self-managed Kafka clusters at scale for our multiple customers as part of migrating them from On-premise / AWS environments to Google Cloud Platform. This post is a how-to guide on deploying a highly available Kafka (version 3.0.0) cluster with Zookeeper ensemble on Compute Engine.

Kafka and It’s Keeper

We are going to deploy this Kafka cluster along with a Zookeeper ensemble. I am sure some of you are wondering why Kafka still needs a keeper. Zookeeper is a well-known Ope source server, which is used to manage Kafka cluster’s metadata, among other things. To avoid external metadata management and simplify Kafka clusters, Kafka in the future is not going to need an external Zookeeper for metadata management, instead — Kafka will internally manage metadata quorum using KRaft(Kafka Raft Metadata Mode). Unfortunately, KRaft is not production ready yet, so for this post — we will go with Zookeeper.

What will we build

We will deploy a highly available Kafka cluster in Mumbai region on GCP. We will configure a 3 node Zookeeper ensemble (or cluster) spread across 3 availability zones and configure 3 Kafka brokers (you can configure more, depending on your throughput requirements)

Let’s get started!

VPC, Subnets, and firewall rules

If you already have a VPC and subnets, feel free to skip this section. If you don’t have one already, follow along.

Configure your project and region in gcloud

gcloud auth login 
gcloud config set project YOUR-PROJECT-NAME
gcloud config set compute/region asia-south1

If you already have a VPC, you can skip this part. For the sake of this guide, let’s create a simple VPC and one subnet. In production, you would ideally provision these in a private subnet dedicated to the database layer.

We will use the network tag kafka to enable access between Kafka and zookeeper nodes. Let's create a firewall rule. This is rather permissive, you might want to open only necessary ports in production.

Configure Zookeeper

Let’s provision a VM and configure Zookeeper on it. While launching these VMs, we will assign the network tag kafka so that the firewall rule we created above comes into effect and allows traffic between these VMs

Verify if the VM is up and running

gcloud compute instances list --filter="tags.items=kafka"

SSH to this VM. Install and configure Zookeeper.

gcloud beta compute ssh zk-01  \
--tunnel-through-iap \
--zone=asia-south1-a

Install JDK. After installing JDK, it’s important to set the Java heap size in production environments. This will help avoid swapping, which will degrade Zookeeper’s performance. Refer to the Zookeeper administration guide here

sudo apt update 
sudo apt install default-jdk -y

Add a user to run Zookeeper services and grant relevant permissions, we will go with zkadmin

Time to configure Zookeeper. Let’s make a config file.

vi /opt/zookeeper/conf/zoo.cfg

Add the following configuration.

tickTime=2000
dataDir=/data/zookeeper
clientPort=2181
initLimit=10
syncLimit=5

We can quickly verify if everything went well so far by starting zookeeper

cd /opt/zookeeper && /opt/zookeeper/bin/zkServer.sh start

You can connect and verify

bin/zkCli.sh -server 127.0.0.1:2181

You should see something like this

WATCHER::WatchedEvent state:SyncConnected type:None path:null
[zk: 127.0.0.1:2181(CONNECTED) 0]

let’s stop Zookeeper and proceed with the rest of the setup

sudo bin/zkServer.sh stop

Let’s create a systemd unit file so that we can manage Zookeeper as a service — sudo vi /etc/systemd/system/zookeeper.service

Add the following

Now that we have tested that Zookeeper on one instance, we can create an image out of it and use it to spin up 2 more nodes. Let’s make an image.

gcloud beta compute machine-images \
  create zookeeper-machine-image \
  --source-instance zk-01 \
  --source-instance-zone asia-south1-a

We can now use this image to create two more VMs — zk-02 and zk-03 in ap-south1-b and ap-south1-c zones respectively. Let's create these VMs

Once all the zookeeper nodes are up and running, We will create a myid file in each of the zookeeper nodes. The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the Zookeeper ensemble and should have a value between 1 and 255.

Let’s now edit the zookeeper configuration file on all nodes to form the cluster quorum. We will add the hostname and ports of each node in the configuration file. You can find the FQDN of your zookeeper nodes by running the command hostname -A, usually it will be in the following format

[SERVER-NAME].[ZONE].[PROJECT-NAME].internal

Example: zk-01.asia-south1-a.c.cloudside-academy.internal. If your zookeeper cluster is a single zone deployment, you can simply resolve using the VM name - Ex: ping zk-03 . Since our deployment is multi-zone and you can not resolve names without using an FQDN across Zones, let's use an FQDN. If you want to keep things simpler, you may add the static internal IP of each node instead of the domain name.

SSH to each zookeeper node. On each node, change to zkadmin user - su -l zkadmin and open /opt/zookeeper/conf/zoo.cfg. The config file on all nodes should look like this:

we have added 4lw.commands.whitelist=* config option to whitelist commands like stat, ruok, conf, isro.

Start Zookeeper Cluster

We are now ready to start the cluster. Let’s enable zookeeper.service on all nodes and also start it.

You may now connect to one of the zookeeper nodes and check the cluster status.

You will see the status as either leader or follower for each node

Setting Up Kafka

Zookeeper ensemble is up and running, so let’s move on to configuring Kafka brokers. We will configure broker on one VM, and from it’s image, spin up two more Kafka brokers.

You might want to choose the right instance type in production and also attach an SSD disk for storage. For this guide, we will go ahead with e2-small VM. Let’s reserve three internal static IPs for our Kafka brokers

SSH to the VM and start setting up Kafka.

gcloud beta compute ssh kafka-broker-01 \
    --tunnel-through-iap \
    --zone=asia-south1-a

Create a user for running Kafka services and download binaries

You can use the following configuration file, to begin with.

For this first kafka broker, we will assign broker.id as 0 and you can add more brokers and increment the id. use the internal IP of the broker VM to configure advertised.listeners, and listeners. Finally, configure zookeeper.connect to include all zookeeper nodes (use either FQDNs or IP addresses)

Move the existing existing sample config file and create a new config with the contents from above gist

Create a Kafka Service

Let’s create a systemd file — sudo vi /etc/systemd/system/kafka.service

Add the following:

Start the service — systemctl start kafka

You can now connect to one of the Zookeeper nodes and verify that the kafka broker with id 0 shows up.

#SSH to any zookeeper node and cd /opt/zookeeper bin/zkCli.sh -server 127.0.0.1:2181

After connecting, run

ls /brokers/ids

You should see this

Create an Image

We are now ready to make an image out of this VM and use it to spin up more Kafka broker nodes. Let’s make an image

gcloud beta compute machine-images \
  create kafka-machine-image \
  --source-instance kafka-broker-01 \
  --source-instance-zone asia-south1-a

Let’s now create kafka-broker-02, kafka-broker-03 VMs in asia-south1-b, asia-south1-c zones respectively.

Once these VMs are online, all you need to do is to SSH to these VMs and change the following properties in the configuration file — vi /data/kafka/config/server.properties

For kafka-broker-02 VM

broker.id=1
listeners=PLAINTEXT://10.10.0.21:9092
advertised.listeners=PLAINTEXT://10.10.0.21:9092

For kafka-broker-03 VM

broker.id=2
listeners=PLAINTEXT://10.10.0.22:9092
advertised.listeners=PLAINTEXT://10.10.0.22:9092

Since we are launching these VMs from an image, and there might be some data left which belonged to Broker 0, we have to cleanup the data (log) directories before we start the service.

sudo rm -rf /data/kafka/logs/ && systemctl start kafka

You can now finally verify the status of all brokers in one of the zookeeper nodes

Congratulations! you have now set up Kafka the hard way on Google Compute Engine. Let’s clean up now.

Cleanup

Hope you found this guide useful. If you need help running self-managed kafka at scale on Google Compute Engine, reach out to us! Happy stream processing :)