https://developer.ibm.com/recipes/tutorials/building-a-hadoop-cluster-with-raspberry-pi/
Ingredients
- Raspberry Pi 2 (two minimum)
- Micro SD cards, 8GB minimum (one for each Raspberry Pi)
- Wireless adapters or means to build a wired network
Step-by-step
Introduction
Hadoop has great potential and is one of the best known projects for big data. In this tutorial, we will install and configure a Hadoop cluster using Raspberries. Our cluster will consists on twelve nodes (one master and eleven slaves).
Since Raspberries have a low price and Hadoop is open source, together they offer a great opportunity for anybody who wants to start working with big data. We will be configuring the cluster from a Linux workstation, we will be working exclusively with Linux commands. The cluster can be built with other families of operating systems, and the general steps in this tutorial would remain to be useful, but the specific commands would be different.
Download and install an Operating System into the micro SD cards
Any OS that is compatible with Hadoop will work. In our case, we selected Raspbian Jessy Lite as our OS. It is a lightweight OS, designed specifically for Raspberries. It can be downloaded from the official Raspberry Pi Foundation website: https://www.raspberrypi.org/downloads/raspbian/
Once we have downloaded our .img file, we can install it into a micro SD card. This process will overwrite any data that the SD card may have, so it is recommended to use an empty card.
There are many ways to install the OS into the SD card. We will use the dd Linux command:
sudo dd if=2016-02-26-raspbian-jessie-lite.img of=/dev/sddThe if parameter is the image file that we downloaded, The of parameter is the mount point of our micro SD card.
We will repeat this installation in every micro SD card.
Once that is done, you can insert the micro SD cards into the Raspberries and login into them via SSH as the default user “pi” using the password “raspberry”.
You may need to expand the filesystem in order to use the full capacity of the micro SD cards. If this is the case, login into the Raspberries with SSH and run:
sudo raspi-configThen select “Expand Filesystem”. After this is done, you will need to reboot the Raspberry. You can do it like this:
shutdown -r now
Create the installation and HDFS directories
We need a installation directory for Hadoop. We also need a directory where the HDFS (Hadoop Distributed File System) will place its files. In our example, we selected /opt/hadoop/ as the installation directory and /opt/hadoop_tmp/ as the path that will be used by HDFS. So let’s create those directories
sudo mkdir /opt/hadoop/sudo mkdir /opt/hadoop_tmp/Again, make sure to do this in all the Raspberries. Take note of the HDFS directory, we will need it later while editing the configuration files.
Install Java
Since Hadoop requires it, Java needs to be installed in all the nodes. Hadoop 2.7 and later require Java 7. Hadoop 2.6 still supports Java 6. Install the version compatible with your desired Hadoop environment. In Raspbian, to install Oracle Java 8 we would do this:
sudo apt-get updatesudo apt-get install oracle-java8-jdkVerify that it was installed correctly:
java -versionjava version “1.8.0_65”Java(TM) SE Runtime Environment (build 1.8.0_65-b17)Java HotSpot(TM) Client VM (build 25.65-b01, mixed mode)
Setup Connectivity
Depending on your environment, this step could vary. If you are building the cluster on a local wifi network you will need a wireless adapter for each Raspberry (unless you are using Raspberry Pi 3, which has integrated wireless). Otherwise you may need to setup a wired network. What we need is that every Raspberry has its own IP address and that they can connect to each other without problems. This tutorial will not cover how to setup wireless connectivity in the Raspberries, but this is easy to do and there are many tutorials explaining how to do it. The official documentation is a good example: https://www.raspberrypi.org/documentation/configuration/wireless/wireless-cli.md
To make our job easier, we will use an alias for each node from now on. Make a list of all the IP addressess and add them to the /etc/hosts file in every Raspberry. In our case, it looks like this:
9.42.157.175 master9.42.157.182 slave-019.42.157.186 slave-029.42.157.187 slave-039.42.157.194 slave-049.42.157.204 slave-059.42.157.210 slave-069.42.157.233 slave-079.42.157.237 slave-089.42.157.239 slave-099.42.157.241 slave-109.42.157.242 slave-11You can designate any Raspberry as the master node, just make sure to have a proper list in order to avoid confusion.
Generate and replicate SSH keys
We need all the Raspberies to be able to communicate between each other without prompting for passwords. For that, we will generate SSH keys in all of them.
ssh-keygen -t rsa -b 4096 -C your_email@example.comAnd then replicate the keys to all the Raspberies.
ssh-copy-id user@masterssh-copy-id user@slave-01ssh-copy-id user@slave-02ssh-copy-id user@slave-03ssh-copy-id user@slave-04ssh-copy-id user@slave-05ssh-copy-id user@slave-06ssh-copy-id user@slave-07ssh-copy-id user@slave-08ssh-copy-id user@slave-09ssh-copy-id user@slave-10ssh-copy-id user@slave-11After you are done, every Raspberry should have all the keys. This is the step where most people make mistakes, and most of the problems when starting the cluster come from an erroneous configuration of the SSH keys. So take your time to verify that all the nodes can connect to each other.
Add environment variables
Now, we will add some Hadoop environment variables to our ~/.bashrc file. Change the paths according to your environment.
# — HADOOP ENVIRONMENT VARIABLES START — #export HADOOP_HOME=/opt/hadoop/hadoopexport PATH=$PATH:$HADOOP_HOME/binexport PATH=$PATH:$HADOOP_HOME/sbinexport HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport YARN_HOME=$HADOOP_HOMEexport HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeexport HADOOP_OPTS=“-Djava.library.path=$HADOOP_HOME/lib”# — HADOOP ENVIRONMENT VARIABLES END — #export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jreIf you installed Hadoop in a different path, make sure to change HADOOP_HOME. Make sure to add JAVA_HOME as well, which may be different in your environment.
Install Hadoop in the namenode
The namenode is our master node. The slave nodes are refered as datanodes. We will install Hadoop on our namenode and perform some configurations, then, we will work on the datanodes. Since this is the basic configuration, this and the following steps will be done only in the namenode. Then we will copy this basic configuration to the remaining 11 nodes, and then we will perform specific customizations.
In the releases page (https://hadoop.apache.org/releases.html) of the Hadoop site you can see which is the newest Hadoop version.
Go to the installation directory that we created in the second step. Select a Hadoop version from the download page and get the URL of the tarball. The following commands will download a Hadoop package and uncompress it.
cd /opt/hadoop/
sudo wget http://www-us.apache.org/dist/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz
sudo tar xvf hadoop-2.6.4.tar.gz
mv hadoop-2.6.4 hadoop
Add the master and slave files
This step should only be done in the Master node. We must add two files in $HADOOP_HOME/hadoop/etc/hadoop/, one of them will be named master and the other will be named slaves,
In this new master file we only need to add the hostname or alias of the master node. So in our case this file will only contain one line:
masterIn the slaves file we will only add our slaves nodes, one alias per line. In our case:
slave-01slave-02slave-03slave-04slave-05slave-06slave-07slave-08slave-09slave-10slave-11
Setup HDFS
A couple of directories need to be created in our HDFS directory. In the namenode. go to the HDFS directory (/opt/hadoop_tmp/ in our case) and create a directory named “hdfs”, then, inside that directory create another directory named “namenode”.
sudo mkdir /opt/hadoop_tmp/hdfssudo mkdir /opt/hadoop_tmp/hdfs/namenodeNext, we will go to every slave node, and do the same, except that the newest directory will be named “datanode”:
sudo mkdir /opt/hadoop_tmp/hdfssudo mkdir /opt/hadoop_tmp/hdfs/datanode
Copy the basic configuration to the slave nodes
The previous steps for the Hadoop installation were done only in the master node. Now we will take those files and copy them to the remaining Raspberries.
On the master node, run the following commands in order to copy the basic setup to the slaves (you can also do this with scp):
rsync -avxP <$HADOOP_HOME> <hadoop_user>@<slave-hostname>:<$HADOOP_HOME>In our case, it would be something like this:
rsync -avxP /opt/hadoop/ user@slave-01:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-02:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-03:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-04:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-05:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-06:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-07:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-08:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-09:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-10:/opt/hadoop/rsync -avxP /opt/hadoop/ user@slave-11:/opt/hadoop/
Edit the configuration files
There are several XML files that need to be modified according to our architecture. These changes need to be done in the Master and in all the Slave nodes.
Update core-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/core-site.xml), we will change “localhost” to our master node hostname, IP, or alias. In our case, it would be like this:
<configuration><property><name>fs.default.name</name><value>hdfs://master:9000/</value></property><property><name>fs.default.FS</name><value>hdfs://master:9000/</value></property></configuration>Update hdfs-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/hdfs-site.xml), we will change the replication factor to the number of datanodes you have, and specify the datanode and namenodes directories (which we created in the previous step), as well as add an http address:
<configuration><property><name>dfs.datanode.data.dir</name><value>/opt/hadoop_tmp/hdfs/datanode</value><final>true</final></property><property><name>dfs.namenode.name.dir</name><value>/opt/hadoop_tmp/hdfs/namenode</value><final>true</final></property><property><name>dfs.namenode.http-address</name><value>master:50070</value></property><property><name>dfs.replication</name><value>11</value></property></configuration>Update yarn-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/yarn-site.xml). There will be 3 properties that we need to update with our master node hostname or alias:
<configuration><property><name>yarn.resourcemanager.resource-tracker.address</name><value>master:8025</value></property><property><name>yarn.resourcemanager.scheduler.address</name><value>master:8035</value></property><property><name>yarn.resourcemanager.address</name><value>master:8050</value></property></configuration>Update mapred-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/mapred-site.xml), we will add the following properties:
<configuration><property><name>mapreduce.job.tracker</name><value>master:5431</value></property><property><name>mapred.framework.name</name><value>yarn</value></property></configuration>
Format the namenode and start the services
Go to the namenode, in $HADOOP_HOME/bin/ execute the following:
./hdfs namenode -formatThat will format the master as a proper namenode. Finally, start the services. Previously, Hadoop used the start-all.sh script for this, but it has been deprecated and the recommended method now is to use the start-<service>.sh scripts individually. In each node, go to $HADOOP_HOME/sbin/ and execute the following:
./start-yarn.sh./start-dfs.shThe most common problem here is lack connection between the nodes. If you see any errors while starting the nodes, check the SSH keys.
Verify the cluster
There are several ways to confirm that everything is running properly, for example, you can point your browser to http://master:50070 (the http address configured earlier) or to http://master:8088. There you can see status reports of the cluster. Alternatively, you can also check the Java processes in the nodes. In the master node you should see at least 3 processes: NameNode, SecondaryNameNode and ResourceManager. In the slaves you should see only two: NodeManager and DataNode. My favourite way is to use the hdfs report command. This is the output report of my configuration:
/opt/hadoop/hadoop/bin/hdfs dfsadmin -reportConfigured Capacity: 182240477184 (169.72 GB)Present Capacity: 153073315840 (142.56 GB)DFS Remaining: 153073020928 (142.56 GB)DFS Used: 294912 (288 KB)DFS Used%: 0.00%Under replicated blocks: 0Blocks with corrupt replicas: 0Missing blocks: 0-------------------------------------------------Live datanodes (12):Name: 9.42.157.210:50010 (slave-06)Hostname: Pi007-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2411073536 (2.25 GB)DFS Remaining: 12775608320 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.12%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.204:50010 (slave-05)Hostname: Pi006-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2411417600 (2.25 GB)DFS Remaining: 12775264256 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.12%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.237:50010 (slave-08)Hostname: Pi009-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2410917888 (2.25 GB)DFS Remaining: 12775763968 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.12%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.186:50010 (slave-02)Hostname: Pi003-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2411122688 (2.25 GB)DFS Remaining: 12775559168 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.12%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.187:50010 (slave-03)Hostname: Pi004-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2410520576 (2.24 GB)DFS Remaining: 12776161280 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.13%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.175:50010 (master)Hostname: Pi001-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2626191360 (2.45 GB)DFS Remaining: 12560490496 (11.70 GB)DFS Used%: 0.00%DFS Remaining%: 82.71%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.182:50010 (slave-01)Hostname: Pi002-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2409967616 (2.24 GB)DFS Remaining: 12776714240 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.13%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.242:50010 (slave-11)Hostname: Pi012-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2410094592 (2.24 GB)DFS Remaining: 12776587264 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.13%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.241:50010 (slave-10)Hostname: Pi011-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2409664512 (2.24 GB)DFS Remaining: 12777017344 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.13%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.233:50010 (slave-07)Hostname: Pi008-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2410676224 (2.25 GB)DFS Remaining: 12776005632 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.13%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.194:50010 (slave-04)Hostname: Pi005-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2410934272 (2.25 GB)DFS Remaining: 12775747584 (11.90 GB)DFS Used%: 0.00%DFS Remaining%: 84.12%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016Name: 9.42.157.239:50010 (slave-09)Hostname: Pi010-rtpDecommission Status : NormalConfigured Capacity: 15186706432 (14.14 GB)DFS Used: 24576 (24 KB)Non DFS Used: 2434580480 (2.27 GB)DFS Remaining: 12752101376 (11.88 GB)DFS Used%: 0.00%DFS Remaining%: 83.97%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Tue Mar 22 21:10:30 EDT 2016You can see that HDFS takes the space available in each node and combines it as one single file system.
Congratulations! At this point, Hadoop is configured and running. You can stop here, but I would prefer to run a few tests, just to make sure everything is working correctly. We will upload some files to the HDFS.
First, check the space usage:
/opt/hadoop/hadoop/bin/hadoop fs -df -hFilesystem Size Used Available Use%hdfs://master:9000 169.7 G 10.3 G 132.2 G 6%Create a test directory:
/opt/hadoop/hadoop/bin/hadoop fs -mkdir hdfs://master:9000/testdir/I have a test file in the local file system (specifically, in /tmp) we will upload that file into HDFS, inside the directory we just created:
/opt/hadoop/hadoop/bin/hadoop fs -copyFromLocal /tmp/testFile.txt hdfs://master:9000/testdir/testFile.txtWe can list the files in HDFS and confirm that the file was correctly uploaded:
/opt/hadoop/hadoop/bin/hadoop fs -ls hdfs://master:9000/Found 1 itemsdrwxr-xr-x - root supergroup 0 2016-03-23 13:36 hdfs://master:9000/testdir/opt/hadoop/hadoop/bin/hadoop fs -ls hdfs://master:9000/testdir/Found 1 items-rw-r—r— 11 root supergroup 542456 2016-03-23 13:36 hdfs://master:9000/testdir/testFile.txtFor every file and directory in the HDFS, we can also check things like the replication factor, missing replicas, number of racks, file size and many other things. Here, we will check the status of our test directory and our test file:
/opt/hadoop/hadoop/bin/hdfs fsck /testdir/ -files -blocks -racksConnecting to namenode via http://master:50070FSCK started by root (auth:SIMPLE) from /9.42.157.175 for path /testdir/ at Wed Mar 23 13:48:04 EDT 2016/testdir/ <dir>/testdir/testFile.txt 542456 bytes, 1 block(s): OK0. BP-153620240-127.0.1.1-1457972366593:blk_1073741834_1010 len=542456 repl=11[/default-rack/9.42.157.175:50010,/default-rack/9.42.157.204:50010,/default-rack/9.42.157.233:50010,/default-rack/9.42.157.194:50010,/default-rack/9.42.157.239:50010,/default-rack/9.42.157.187:50010,/default-rack/9.42.157.237:50010,/default-rack/9.42.157.182:50010,/default-rack/9.42.157.210:50010,/default-rack/9.42.157.186:50010,/default-rack/9.42.157.242:50010]Status: HEALTHYTotal size: 542456 BTotal dirs: 1Total files: 1Total symlinks: 0Total blocks (validated): 1 (avg. block size 542456 B)Minimally replicated blocks: 1 (100.0 %)Over-replicated blocks: 0 (0.0 %)Under-replicated blocks: 0 (0.0 %)Mis-replicated blocks: 0 (0.0 %)Default replication factor: 11Average block replication: 11.0Corrupt blocks: 0Missing replicas: 0 (0.0 %)Number of data-nodes: 12Number of racks: 1FSCK ended at Wed Mar 23 13:48:04 EDT 2016 in 16 millisecondsThe filesystem under path ‘/testdir/’ is HEALTHYSince this is a very small text file (only 529KB), and our block size is 128MB there is only need for one block to store it. However, if we upload or create a bigger file in HDFS we can see that it needs more blocks, and these blocks are distributed in the cluster following our replication factor parameter. Take, for example, a 1GB file that I created using the Teragen sample program bundled with Hadoop (the following output was abridged for brevity reasons):
/opt/hadoop/hadoop/bin/hdfs fsck /teraInput/ -files -blocks -racksConnecting to namenode via http://master:50070FSCK started by root (auth:SIMPLE) from /9.42.157.175 for path /teraInput/ at Wed Mar 23 13:59:53 EDT 2016/teraInput/ <dir>/teraInput/_SUCCESS 0 bytes, 0 block(s): OK/teraInput/part-m-00000 1000000000 bytes, 8 block(s): OK0. BP-153620240-127.0.1.1-1457972366593:blk_1073741825_1001 len=134217728 repl=11 [/default-rack/9.42.157.175:50010,…]1. BP-153620240-127.0.1.1-1457972366593:blk_1073741826_1002 len=134217728 repl=11 [/default-rack/9.42.157.175:50010,…]2. BP-153620240-127.0.1.1-1457972366593:blk_1073741827_1003 len=134217728 repl=11 [/default-rack/9.42.157.175:50010,…]3. BP-153620240-127.0.1.1-1457972366593:blk_1073741828_1004 len=134217728 repl=11 [/default-rack/9.42.157.175:50010,…]4. BP-153620240-127.0.1.1-1457972366593:blk_1073741829_1005 len=134217728 repl=11 [/default-rack/9.42.157.175:50010,…]5. BP-153620240-127.0.1.1-1457972366593:blk_1073741830_1006 len=134217728 repl=11 [/default-rack/9.42.157.175:50010,…]6. BP-153620240-127.0.1.1-1457972366593:blk_1073741831_1007 len=134217728 repl=11 [/default-rack/9.42.157.175:50010,…]7. BP-153620240-127.0.1.1-1457972366593:blk_1073741832_1008 len=60475904 repl=11 [/default-rack/9.42.157.175:50010,…]Status: HEALTHYTotal size: 1000000000 BTotal dirs: 1Total files: 2Total symlinks: 0Total blocks (validated): 8 (avg. block size 125000000 B)Minimally replicated blocks: 8 (100.0 %)Over-replicated blocks: 0 (0.0 %)Under-replicated blocks: 0 (0.0 %)Mis-replicated blocks: 0 (0.0 %)Default replication factor: 11Average block replication: 11.0Corrupt blocks: 0Missing replicas: 0 (0.0 %)Number of data-nodes: 12Number of racks: 1FSCK ended at Wed Mar 23 13:59:53 EDT 2016 in 13 millisecondsThe filesystem under path ‘/teraInput/’ is HEALTHY
Conclusion
Hadoop is a very powerful tool for big data. In this tutorial we installed and configured it, and also worked a little bit with HDFS. However, Hadoop offers many more possibilities, for example, distributed computing with MapReduce, databases like Hive and many other projects. Also, this is a small test cluster that is only storing small files. Clusters consisting of hundreds of nodes, storing terabytes or petabytes of data, are common. In fact, big clusters like those are where Hadoop and other distributed computing platforms really show how useful they can be.
Installing and configuring Hadoop is not difficult, but we always need to pay very close attention to many aspects, otherwise we will get obscure errors that could be difficult to troubleshoot. Hopefully, this recipe will help those who want to learn about big data technologies.
Doing all this work in the Raspberry Pi platform did not offer any extra difficulty. In fact, it was as easy as working on a “real” server. Obviously, the tiny Raspberries do not offer the same power as more expensive solutions, but building a cluster with Raspberries is much more affordable, and they are generally easier to get, so they are the perfect solution for startups, students, small teams, and projects that do not require high-end hardware.