spark-shell

This page to remember of the first contact with the spark shell and the basic commands to use

scala> spark
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4e0139
scala> val flx = spark.read.parquet("file:///home/me/flx.parquet")
scala> flx.columns
res1: Array[String] = Array(MessageID,...
scala> flx.select("MessageID").show(10,false)
scala> import org.apache.spark.sql.functions._
scala> flx.select("MessageID","B").where(col("MessageID") !== col("B")).show(5,false)
scala> flx.select("X").where(col("X")==="").count()
res20: Long = 13066

but if you want to use custom logic in the where clause seems you have to go back to RDD

val myfile=spark.read.parquet("file:///myfile.parquet")
import org.apache.spark.sql.functions._
import scala.util.parsing.json.JSON
myfile.rdd.filter( row => JSON.parseFull(row.getString(4)).isEmpty).count()

The above code checks and ount how many malformed JSON values you have in column 4… maybe there is a way to use column names instead of numbers but this is to be continued

Written by Giovanni

April 1, 2022 at 2:13 pm

Posted in Varie

Giovanni Bricconi

spark-shell

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta

Giovanni Bricconi

spark-shell

Share this:

Related

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta