Caching Proxy - Installation and Configuration
Setting up a Hadoop cluster is all easy with a bit of familiarity with system and network administration. It's all interesting, the only frustrating thing is the downloading of the patches after the installation of the OS and the downloading of the packages for the softwares on top of OS. The downloads can go to all the way close to a GB also, which might take a couple of minutes to hours based on the internet bandwidth.
Here is where caching tools really help. They will cache the downloaded packages to one of the designated local machine (lets call it the cache server) and the other machines can point to the cache server to get the packages. This way the packages are downloaded from the internet for the first time and from then on the local cache server will be used for getting the packages. This approach will not only save the network bandwidth, but will also make the whole installation process faster.
For debian systems, apt-cacher-ng is designed to cache the packages and is really easy to install and configure. Here are the steps involved:
a) On the cache machine install the apt-cache-ng using the below command. root previlages would be required to run the command.
b) All the different machines in the local network have to point to the cache server using the below command, where `cacheserver` has to be replaced with the appropriate host name/ip of the cache machine.
It's as easy as the above two steps to setup a cache server for a debian system.
For a rpm based system it's a bit more complicated. For rpm based systems squid should be installed either on a debian or a rpm based systems and other systems will fetch the packages from squid. Below are the instructions for installing squid on a debian based system.
a) On the cache machine install squid using the below command for debian based system. Again root previlages would be required to run the command.
b) Uncomment the below line in /etc/squid3/squid.conf file, the default uses memory based caching. With the following settings all the packages will be stored in the /var/spool/squid3 directory.
Uncomment the below line
and add the below lines to enable access to the squid server from the different machine. Based on the network settings/configurations the ip addresses have to be chosen approximately.
c) Add the below to .bashrc for the proxy to take affect to all the applications.
or add the below to /etc/yum.conf for the proxy to work only with yum which is used for installing the packages on rpm based systems. Here is the documentation for the same.
Make sure the fire wall has been disabled or the appropriate port has been opened on the cache server, 3142 for apt-cacher-ng and 3128 for squid. gufw is a front-end to the iptables in Ubuntu.
It is very much customizable and can be very easily tweaked for performance.
Happy Hadooping !!!