Archive for February 2022
Learning Spark 2nd edition
I was searching for a book on spark on O’Reilly site, I have found this one. Luckily the PDF is available online and you do not need to pay for it https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
port forwarding and ubuntu firewall for hadoop
I still needed to use firefox inside the vm to reach the uis in hadoop.
the problem was the ubuntu firewall
# ufw allow 50075
Rule added
# ufw allow 18080
Rule added
# ufw allow 50070
Rule added
# ufw allow 8042
Rule added
# ufw allow 8088
Rule added
# ufw allow 50090
Rule added
# ufw allow 4040
Rule added
hadoop and spark on ubuntu
Hi, I will start working on a big data project, I am setting up my environment.
As usual the versions to use are quite old, once you have a project running it is difficult to make the upgrades to the latest versions.
I am quite old, I did not think at using docker images to set up things, when I realized that I tried to find some hadoop images. Today with a quick google search, I did not find official images so I kept my environment set-up,
So an ubuntu server, that runs without ui
sudo systemctl set-default multi-user.target
this does the magic, then I set up the DISPLAY environment variable to have mobaxterm serve X11. In this way I can use intellij from the box but mixed with windows applications, The same for gedit etc.
Then I installed the glorious spark-2.1.0-bin-hadoop2.7 and hadoop-2.0.7. the set-up has taken a lot of time, luckily there are a lot of guides to do thing step by step.
This one is very nice https://phoenixnap.com/kb/install-hadoop-ubuntu
For spark I had to set up some env var concerning logs location, it has taken a while
Now the spark console spark-shell works and I can play a bit
there are a lot of useful ports to monitor the processes
Useful ports for hadoop and spark
NameNode: fs.defaultFS is hdfs://localhost:9000
namenode http://localhost:50070/dfshealth.html#tab-overview
secondary namenode http://0.0.0.0:50090
data node /0.0.0.0:50075
yarn resource manager port 8088
yarn Node manager http://localhost:8042/node
HistoryServer http://127.0.0.1:18080
spark shell http://127.0.0.1:4040