Giovanni Bricconi

My site on WordPress.com

Archive for April 2022

Remote environments, my bad

leave a comment »

Yep, working in local with a virtual machine. The same old story, it’s better to develop on linux but you need all the funny windows tools so I have a virtual machine running all the day.

This is not a real problem, all works well and with X11 terminal all looks pretty nice.

The problem comes when you need complicated stuff and you cannot get them on your virtual machine and you have to run tests on a remote enviornment.

You need a vpn to connect, and you need to pass through a jump server to access the environment, this for security. Of course you do not want to allow access to many services from outside, but all becomes quite annoying.

You need a new shell, new login, put the password, the second factor authentication, you are then in and you do your stuff. I never had the occasion but I think certificates could help here, and just have the pain of entering the 2nd factor to continue working. It is not the case right now and I have to do that.

Then the development environment is local, and I compile locally some file, in the vm. Then the connection is quite slow and I try to move just the minimum to the remote environment. So another scp with a lot of passwords to put.

Then I do my test in remote, I fix something, and I try again. Of course I am speaking of development/staging environments

One thing I would like to try is to have the full development environment as pod in kubernetes, right next to the staging environment. In this way I could connect once to it, and just use the UI from there. This would really simplify a lot my life.

In the end I am already working on a virtual machine, it does not change much for me if it runs here on my laptop or far far away in a data center.

Written by Giovanni

April 26, 2022 at 3:58 pm

Posted in Varie

spark-shell

leave a comment »

This page to remember of the first contact with the spark shell and the basic commands to use

scala> spark
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4e0139
scala> val flx = spark.read.parquet("file:///home/me/flx.parquet")
scala> flx.columns
res1: Array[String] = Array(MessageID,...
scala> flx.select("MessageID").show(10,false)
scala> import org.apache.spark.sql.functions._
scala> flx.select("MessageID","B").where(col("MessageID") !== col("B")).show(5,false)
scala> flx.select("X").where(col("X")==="").count()
res20: Long = 13066

but if you want to use custom logic in the where clause seems you have to go back to RDD

val myfile=spark.read.parquet("file:///myfile.parquet")
import org.apache.spark.sql.functions._
import scala.util.parsing.json.JSON
myfile.rdd.filter( row => JSON.parseFull(row.getString(4)).isEmpty).count()

The above code checks and ount how many malformed JSON values you have in column 4… maybe there is a way to use column names instead of numbers but this is to be continued

Written by Giovanni

April 1, 2022 at 2:13 pm

Posted in Varie