4 Replies - 722 Views - Last Post: 01 July 2012 - 12:16 PM Rate Topic: -----

#1 raihan26  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 47
  • Joined: 04-February 10

Query Against Two Related Tables (Joins)

Posted 01 July 2012 - 11:56 AM

This is one table in Hive- It contains information about the items we are purchasing.

    CREATE EXTERNAL TABLE IF NOT EXISTS Table1
    (
    UID BIGINT,
    ITEMS_PURCHASED ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
    )


And this is the data in the above table-

    1015826235 [{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"},{"product_id":140002997245,"timestamps":"1339694926000"},{"product_id":200002448035,"timestamps":"1339172659000"},{"product_id":260003553381,"timestamps":"1339072514000"}]


This is the second table in Hive- It also contains information about the item we are purchasing.

    CREATE EXTERNAL TABLE IF NOT EXISTS Table2
    (
    ITEM_ID BIGINT,
    CREATED_TIME STRING,
    BUYER_ID BIGINT
    )


And this is the data in the above second table

    220003038067   2012-06-21  1015826235
    300003861266   2012-06-21  1015826235
    140002997245   2012-06-14  1015826235
    200002448035   2012-06-08  1015826235
    260003553381   2012-06-07  1015826235


Problem Statement-
**We need to compare the above two tables basis on UID( and BUYER_ID). As UID in one table (Table1) and BUYER_ID in second table (Table2), they both are same thing. So I need to see if UID and BUYER_ID gets matched, then ITEMS_PURCHASED in Table1 table should be same as ITEM_ID and CREATED_TIME in Table2 table and if they (means ITEMS_PURCHASED and ITEM_ID, CREATED_TIME) are not same, I need to do something,So Basically I need to generate a report if they gets matched or not matched, means data accuracy report, like this much percentage data is accurate and this much percentage it is not, kind of statistical analysis**

So just to make it more clear-

**ITEMS_PURCHASED is an array of Struct in Table1 table and it contains two things PRODUCT_ID and TIMESTAMPS.
And if UID and BUYER_ID gets matched then PRODUCT_ID in Table1 should be matched with ITEM_ID in Table2 and TIMESTAMPS in Table1 should be matched with CREATED_TIME in Table2.**


And one more thing these tables have millions of data in them. I have reduced it to only one record to simplify the problem so how I can do this problem efficiently.
I think I need to write some MapReduce job for this. And this is the first time I am working with Hive, Hadoop and Map Reduce. So that is the reason I am facing a lot of problem.

I was thinking two solutions-
1) check on millions of data by comparing user id's and buyer_id

2) or sample some UID and buyer_id then compare the data.

3) Any other approach?



Any suggestions will be appreciated

This post has been edited by macosxnerd101: 01 July 2012 - 12:02 PM
Reason for edit:: Renamed title to be more descriptive


Is This A Good Question/Topic? 0
  • +

Replies To: Query Against Two Related Tables (Joins)

#2 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10369
  • View blog
  • Posts: 38,408
  • Joined: 27-December 08

Re: Query Against Two Related Tables (Joins)

Posted 01 July 2012 - 12:02 PM

Start by normalizing your Table1. Using Collections isn't a good idea, as it makes it hard to query the database. For Table1, I would use the structure:
uid (PK) | product_id (FK) | timestamp (FK)



And it sounds like you are trying to query, using a join.

Since this is a database question, not a Java question, I'll move it to Databases.
Was This Post Helpful? 0
  • +
  • -

#3 raihan26  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 47
  • Joined: 04-February 10

Re: Query Against Two Related Tables (Joins)

Posted 01 July 2012 - 12:06 PM

View Postmacosxnerd101, on 01 July 2012 - 12:02 PM, said:

Start by normalizing your Table1. Using Collections isn't a good idea, as it makes it hard to query the database. For Table1, I would use the structure:
uid (PK) | product_id (FK) | timestamp (FK)



And it sounds like you are trying to query, using a join.

Since this is a database question, not a Java question, I'll move it to Databases.


Thanks for your comment. I cannot normalize any of these tables as these tables has to be in this format only and also I have removed lot of other fields from the tables just to make question more clear, it has more than 15 columns in both of the tables. As I am working on the project in which my team mates has asked us to use these two tables only and it was there since long time, so I cannot do anything in that table. SO that means I need to device some other efficient approach. If you are saying JOINS, then joins are not too costly?

This post has been edited by raihan26: 01 July 2012 - 12:07 PM

Was This Post Helpful? 0
  • +
  • -

#4 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10369
  • View blog
  • Posts: 38,408
  • Joined: 27-December 08

Re: Query Against Two Related Tables (Joins)

Posted 01 July 2012 - 12:11 PM

Quote

I cannot normalize any of these tables as these tables has to be in this format only

Refactoring in this case would make everyone's lives easier. A join is really only to be used in a normalized relation.

Quote

As I am working on the project in which my team mates has asked us to use these two tables only and it was there since long time, so I cannot do anything in that table. SO that means I need to device some other efficient approach.

Seeing when something isn't working and refactoring it is an important skill in programming. You and your teammates are approaching this problem the wrong way. A database table is a giant collection. Using collections within the collection is not a good idea. Linking tables and joins are the appropriate relational way to attack the problem. Anything else will create brittle, unmaintainable, and in-extensible solutions.
Was This Post Helpful? 0
  • +
  • -

#5 raihan26  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 47
  • Joined: 04-February 10

Re: Query Against Two Related Tables (Joins)

Posted 01 July 2012 - 12:16 PM

View Postmacosxnerd101, on 01 July 2012 - 12:11 PM, said:

Quote

I cannot normalize any of these tables as these tables has to be in this format only

Refactoring in this case would make everyone's lives easier. A join is really only to be used in a normalized relation.

Quote

As I am working on the project in which my team mates has asked us to use these two tables only and it was there since long time, so I cannot do anything in that table. SO that means I need to device some other efficient approach.

Seeing when something isn't working and refactoring it is an important skill in programming. You and your teammates are approaching this problem the wrong way. A database table is a giant collection. Using collections within the collection is not a good idea. Linking tables and joins are the appropriate relational way to attack the problem. Anything else will create brittle, unmaintainable, and in-extensible solutions.



ummm, As I said, I don't think so I can ask my team mates to normalize the tables as this table format is there since long time. And as you said JOINS are the only option on millions of data?
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1