spark-shell
This page to remember of the first contact with the spark shell and the basic commands to use
scala> spark
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4e0139
scala> val flx = spark.read.parquet("file:///home/me/flx.parquet")
scala> flx.columns
res1: Array[String] = Array(MessageID,...
scala> flx.select("MessageID").show(10,false)
scala> import org.apache.spark.sql.functions._
scala> flx.select("MessageID","B").where(col("MessageID") !== col("B")).show(5,false)
scala> flx.select("X").where(col("X")==="").count()
res20: Long = 13066
but if you want to use custom logic in the where clause seems you have to go back to RDD
val myfile=spark.read.parquet("file:///myfile.parquet")
import org.apache.spark.sql.functions._
import scala.util.parsing.json.JSON
myfile.rdd.filter( row => JSON.parseFull(row.getString(4)).isEmpty).count()
The above code checks and ount how many malformed JSON values you have in column 4… maybe there is a way to use column names instead of numbers but this is to be continued
Leave a comment