Thursday 14 May 2020

SparkSql scenarios

This blog has scenarios and questions which were asked in the recent times, hope this will help you to crack the interviews. To practice these you don't need any high end configuration laptop, Azure databrciks provides free community edition to practice, keep comment the questions which you come across and will help to keep everyone updated technically.
Data is available for the queries in table format and you can download from here as csv format.


Question 1:  A university decides to grade a student based on the improvement he/she has done with respect to his/her previous year performance. A student has 6 subjects to give exams. below is the rules that the university is defined:
1- student marks in 5 or 6 subject is improved compared to last year => A
2- student marks in 3 to 5 subject is improved(ignoring drop in marks in other subjects) compared to last year => B
3- students marks in 1 to 3 subject is improved(ignoring drop in marks in other subjects) compared to last year => C
4- student marks in all subject has decrease => D

Considering the marks data to be of the format:
+---------+----+-----------+-----+ |studentID|Year| Subject|marks| +---------+----+-----------+-----+ | 1|2010| english| 60| | 1|2010| science| 60| | 1|2010| histroy| 60| | 1|2010|mathematics| 60| | 1|2010| kannada| 60| | 1|2010| social| 60| | 2|2010| english| 90| | 2|2010| science| 80| | 2|2010| histroy| 70| | 2|2010|mathematics| 85| | 2|2010| kannada| 70| | 2|2010| social| 85| | 3|2010| english| 60| | 3|2010| science| 60| | 3|2010| histroy| 60| | 3|2010|mathematics| 100| | 3|2010| kannada| 100| | 3|2010| social| 100| | 4|2010| english| 60| | 4|2010| science| 60| +---------+----+-----------+-----+

+---------+----+-----------+-----+ |studentID|Year| Subject|marks| +---------+----+-----------+-----+ | 1|2011| english| 100| | 1|2011| science| 100| | 1|2011| histroy| 100| | 1|2011|mathematics| 100| | 1|2011| kannada| 100| | 1|2011| social| 100| | 2|2011| english| 30| | 2|2011| science| 30| | 2|2011| histroy| 30| | 2|2011|mathematics| 30| | 2|2011| kannada| 30| | 2|2011| social| 30| | 3|2011| english| 100| | 3|2011| science| 100| | 3|2011| histroy| 100| | 3|2011|mathematics| 100| | 3|2011| kannada| 100| | 3|2011| social| 100| | 4|2011| english| 100| | 4|2011| science| 100| +---------+----+-----------+-----+


Write a spark job that can be run at once to generate grades of all students.

Question 2Given a time series data which is a clickstream of user activity is stored in hive, ask is to enrich the data with session id.

Session Definition:
       ·         Session expires after inactivity of 30 mins, because of inactivity no clickstream record will be generated.
·         Session remains active for a total duration of 2 hours
Steps:
·         Load Data in Hive Table.
·         Read the data from hive, use spark batch (Scala) to do the computation.
    Save the results in parquet with enriched data. and also try to the above steps using spark-sql.

Question 3:
In addition to the problem statement given in question 2 assume below scenario as well and design hive table based on it:
·         Get Number of sessions generated in a day.
·         Total time spent by a user in a day
·         Total time spent by a user over a month.
Here are the guidelines and instructions for the solution of above queries:
·         Design hive table
·         Write the script to create the table
·         Load data into table
·         Write all the queries in spark-sql
·         Think in the direction of using partitioning, bucketing, etc. 
ClickStream Data:

+--------------------+------+ | Timestamp|userid| +--------------------+------+ |2018-01-01T11:00:00Z| 01| |2018-01-01T12:00:00Z| 01| |2018-01-01T11:00:00Z| 03| |2018-01-02T11:00:00Z| 04| |2018-01-01T12:15:00Z| 01| +--------------------+------+

Question 4:

Disney’s IOT sensors emit data in the below format. Sensor,Mnemonic,data,timestamp are the data that gets emitted from the IOT device. Write Spark code to find the start_time_stamp and end_time_stamp for each Mnemonic of the Sensor when there is a change of data boolean from 1 to 0 or 0 to 1. The output expected is given below. Hint: Use windowing function of data frame with lead and lag. 

Sensor Data:
+--------+-------------+----+----------------+ | Sensor| Mnemonic|data| timestamp| +--------+-------------+----+----------------+ |SensorIO|icATswJogMain| 1|1543273336518117| |SensorIO|icATswJogMain| 0|1543273336183163| |SensorIO|icATswJogMain| 0|1543273336200161| |SensorIO|icATswJogMain| 0|1543273336208166| |SensorIO|icATswJogMain| 1|1543273336213166| |SensorIO|icATswJogMain| 0|1543273336185163| |SensorIO|icATswJogMain| 1|1543273336201168| |SensorIO|icATswJogMain| 0|1543273336212165| |SensorIO|icATswJogMain| 0|1543273336277159| |SensorIO|icATswJogMain| 0|1543273336192166| |SensorIO|icATswJogMain| 0|1543273336193169| |SensorIO|icATswJogMain| 0|1543273336207161| |SensorIO|icATswJogMain| 0|1543273336239163| |SensorIO|icATswJogMain| 1|1543273336190170| +--------+-------------+----+----------------+


OutPut Expected:

+--------+-------------+----+----------------+----------------+ | SENSOR| MNEMONIC|DATA| START_DATE| END_DATE| +--------+-------------+----+----------------+----------------+ |SensorIO|icATswJogMain| 1|1543273336518117|1543273336183163| |SensorIO|icATswJogMain| 0|1543273336208166|1543273336213166| |SensorIO|icATswJogMain| 1|1543273336213166|1543273336185163| |SensorIO|icATswJogMain| 0|1543273336185163|1543273336201168| |SensorIO|icATswJogMain| 1|1543273336201168|1543273336212165| |SensorIO|icATswJogMain| 0|1543273336239163|1543273336190170| |SensorIO|icATswJogMain| 1|1543273336190170| null| +--------+-------------+----+----------------+----------------+

Question 5:

Visitors of an eCommerce site browse multiple products during their visit. All visit data of a visitor is consolidated in a JSON document containing vistor Id and a list of product Ids, along with an interest attribute containing value of interest expressed by visitor in a product.  Here are two example records - rec1 and rec2 containing visit data of two visitors v1 and v2:
 
val rec1: String = """{
    "visitorId": "v1",
    "products": [{
         "id": "i1",
         "interest": 0.68
    }, {
         "id": "i2",
         "interest": 0.42
    }]
}"""
 
val rec2: String = """{
    "visitorId": "v2",
    "products": [{
         "id": "i1",
         "interest": 0.78
    }, {
         "id": "i3",
         "interest": 0.11
    }]
}"""
 
val visitsData: Seq[String] = Seq(rec1, rec2)
 
val productIdToNameMap = Map("i1" -> "Nike Shoes", "i2" -> "Umbrella", "i3" -> "Jeans")
 
Given the collection of records (visitsData) and a map (productIdToNameMap) of product Ids and their names:

1. Write the code to enrich every record contained in visitsData with the name of the product. The output should be another sequence with all the original JSON documents enriched with product name. Here is the example output.
 
val output: Seq[String] = Seq(enrichedRec1, enrichedRec1)
 
where enrichedRec1 has value -
"""{
    "visitorId": "v1",
    "products": [{
         "id": "i1",
         "name": "Nike Shoes",
         "interest": 0.68
    }, {
         "id": "i2",
         "name": "Umbrella",
         "interest": 0.42
    }]
}"""
 
And enrichedRec2 has value -
"""{
    "visitorId": "v2",
    "products": [{
         "id": "i1",
         "name": "Nike Shoes",
         "interest": 0.78
    }, {
         "id": "i3",
         "name": "Jeans",
         "interest": 0.11
    }]
}"""
 
2. Please write two sets of code - one using only scala (no Spark) and another one by using Spark RDD/Dataframe so that enrichment of data happens in parallel. However, output of both sets of code should be the same.
3. Include unit tests with your code.








5 comments:

  1. Thanks Sparky. It's very helpfull to my interview.

    ReplyDelete
  2. Thanks Sparky

    ReplyDelete
  3. So do you have solution for question 1? I tried lag/lead.. max, min.. i dint get. you have any approach

    ReplyDelete
    Replies
    1. Hi Shiva,

      Try this approach..

      val student10=spark.read.format("csv").option("header","true").option("inferSchema","true").load("/FileStore/tables/student_2010-28a8b.csv")
      val student11=spark.read.format("csv").option("header","true").option("inferSchema","true").load("/FileStore/tables/student_2011-d4583.csv")
      val combinedStudents=student10.union(student11)
      combinedStudents.createOrReplaceTempView("students")
      val df=spark.sql("""select nstudent.studentID,case when gradeCount >5 THEN 'A'
      when gradeCount >=3 AND gradeCount<=5 THEN 'B'
      when gradeCount>=1 AND gradeCount<=3 THEN 'C'
      ELSE
      'D'
      END as ResultGrade
      FROM (
      select studentID, count(*) as gradeCount
      from(
      select * from (
      select *,marks-oldScores as differScores from (
      select studentID,subject,Year,marks,lag(marks,1,0) over(partition by studentID, subject order by Year asc) as oldScores from students ) where oldScores>0)
      d where differScores>0 order by studentID) group by studentID) cstudent RIGHT OUTER JOIN (select distinct studentID from students) nstudent on cstudent.studentId=nstudent.studentID""")

      df.show()

      Delete
  4. Hi, do you have the solution for third question?

    ReplyDelete