CREATE EXTERNAL TABLE IF NOT EXISTS Table1
(
UID BIGINT,
ITEMS_PURCHASED ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)
And this is the data in the above table-
1015826235 [{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"},{"product_id":140002997245,"timestamps":"1339694926000"},{"product_id":200002448035,"timestamps":"1339172659000"},{"product_id":260003553381,"timestamps":"1339072514000"}]
This is the second table in Hive- It also contains information about the item we are purchasing.
CREATE EXTERNAL TABLE IF NOT EXISTS Table2
(
ITEM_ID BIGINT,
CREATED_TIME STRING,
BUYER_ID BIGINT
)
And this is the data in the above second table
220003038067 2012-06-21 1015826235
300003861266 2012-06-21 1015826235
140002997245 2012-06-14 1015826235
200002448035 2012-06-08 1015826235
260003553381 2012-06-07 1015826235
Problem Statement-
**We need to compare the above two tables basis on UID( and BUYER_ID). As UID in one table (Table1) and BUYER_ID in second table (Table2), they both are same thing. So I need to see if UID and BUYER_ID gets matched, then ITEMS_PURCHASED in Table1 table should be same as ITEM_ID and CREATED_TIME in Table2 table and if they (means ITEMS_PURCHASED and ITEM_ID, CREATED_TIME) are not same, I need to do something,So Basically I need to generate a report if they gets matched or not matched, means data accuracy report, like this much percentage data is accurate and this much percentage it is not, kind of statistical analysis**
So just to make it more clear-
**ITEMS_PURCHASED is an array of Struct in Table1 table and it contains two things PRODUCT_ID and TIMESTAMPS.
And if UID and BUYER_ID gets matched then PRODUCT_ID in Table1 should be matched with ITEM_ID in Table2 and TIMESTAMPS in Table1 should be matched with CREATED_TIME in Table2.**
And one more thing these tables have millions of data in them. I have reduced it to only one record to simplify the problem so how I can do this problem efficiently.
I think I need to write some MapReduce job for this. And this is the first time I am working with Hive, Hadoop and Map Reduce. So that is the reason I am facing a lot of problem.
I was thinking two solutions-
1) check on millions of data by comparing user id's and buyer_id 2) or sample some UID and buyer_id then compare the data. 3) Any other approach?
Any suggestions will be appreciated
This post has been edited by macosxnerd101: 01 July 2012 - 12:02 PM
Reason for edit:: Renamed title to be more descriptive

New Topic/Question
Reply




MultiQuote







|