Ranco

Aug 01, 2024 at 23:58

I was once interviewed by a data engineer and was asked to solve an interesting question where to find all the sales records that has previously been sold in the past 5 years. Suppose your dataset looks like this: HouseID | year_sold 1 | 2024 1 | 2020 1 | 2011 2 | 2023 2 | 2017 2 | 2015 3 | 2022 3 | 2019 3 | 2011 For example, the first row 1 | 2024 should be kept as 4 year ago in 2020 it was sold however the second row 1 | 2020 should be dropped as the earlier sale was back in 2011 which is 9 years ago. In the interview, I tried to solve it using same table join which does the job but is probably not very efficient. Then the interviewer asked me if window function can be applied here to which i answered yes as I was reading a book that talks about window function and grouping sets so i had a vague impression that a window function should also solve this. To my surprise, the interviewer corrected me saying no a window function doesn't apply here which I didn't bother giving another thought and simply admitted my "wrong" answer. Just now when I was reading the same book, this little interview twist suddenly popped into my mind so i quickly recreated the dataset and tackled this same question again and viola! I cracked it! Note the following are Postgresql implementation Solution 1: same table join SELECT * FROM (SELECT h1.*,h2."HouseID" AS "HouseID2",h2.year_sold AS year2 FROM houses h1 JOIN houses h2 ON h1."HouseID"=h2."HouseID") AS cartesian WHERE "year_sold"-year2>0 AND "year_sold"-year2<=5; Solution 2: windowing function SELECT "HouseID","year_sold",lag(lower_bound,1) OVER (PARTITION BY "HouseID") AS "minimum_year_threshold_for_previous_sale" FROM ( SELECT "HouseID","year_sold",max("year_sold"-5) OVER (PARTITION BY "HouseID","year_sold" ORDER BY "HouseID","year_sold" desc) AS "lower_bound" FROM houses) AS sub;