3.5k views

Consider the following relation schemas :

• b-Schema = (b-name, b-city, assets)
• a-Schema = (a-num, b-name, bal)
• d-Schema = (c-name, a-number)

Let branch, account and depositor be respectively instances of the above schemas. Assume that account and depositor relations are much bigger than the branch relation.

Consider the following query:

Пc-nameb-city = "Agra" ⋀ bal < 0 (branch ⋈ (account ⋈ depositor)

Which one of the following queries is the most efficient version of the above query ?

1. Пc-namebal < 0 b-city = "Agra" branch ⋈ account) ⋈ depositor)
2. Пc-nameb-city = "Agra" branch ⋈ (σbal < 0 account ⋈ depositor))
3. Пc-name ((σb-city = "Agra" branch ⋈ σb-city = "Agra" ⋀  bal < 0 account) ⋈ depositor)
4. Пc-nameb-city = "Agra" branch ⋈ (σb-city = "Agra" ⋀  bal < 0 account ⋈ depositor))

retagged | 3.5k views
0
d schema  a-number refers a schema a-num?
0

it must be typo, it should be a-num otherwise natural join will not join schema-d

+6
option (c) and (d) are clearly invalid as they use b-city as an attribute of account schema in the query which is not present in account schema.

and for finding correct between a and b

consider the case where all account holders have the negative balance.(worst case)

It should be A. As in B we are doing a join between two massive table whereas in A we are doing join between relatively smaller table and larger one and the output that this inner table gives (which is smaller in comparison to joins that we are doing in B) is used for join with depositer table with the selection condition.

Options C and D are invalid as there is no b-city column in a-Schema.

Lets see in detail. Let there be 100 different branches. Say about $10$% of accounts are below $0$. Also, let there be $10,000$ accounts in a branch amounting to $1,000,000$ total accounts. A customer can have multiple accounts, so let there be on average $2$ accounts per customer. So, this amounts to $2,000,000$ total entries in depositor table. Lets assume these assumptions are true for all the branches. So, now lets evaluate options A and B.

1. All the accounts in Agra branch, filter by positive balance, and then depositor details of them. So,

• Get branch name from branch table after processing $100$ records
• Filter $10,000$ accounts after processing $1,000,000$ accounts belonging to Agra
• Filter 1000 accounts after processing 10,000 accounts for positive balance
• Get $500$ depositor details after processing $2,000,000$ entries for the given $1000$ accounts (assuming $1$ customer having $2$ accounts). So, totally this amounts to $2,000,000,000$ record processing.
• So totally $\approx$ 2 billion records needs processing.

2. All the positive balance accounts are found first, and then those in Agra are found.

• Filter $100,000$ accounts after processing $1,000,000$ accounts having positive balance
• Find the deposito details of these accounts. So, $100,000$*$2,000,000$ records need processing and this is a much larger value than for query A. Even if we reduce the percentage of positive balance ($10$ we assumed) the record processing of query A will also get reduced by same rate. So, overall query A is much better than query B.
by Loyal (6.9k points)
edited
+3
@ Marv, account is bigger in size

1) second query filters account with bal < 0  first , making it very small in comparison with option A , becoz in option A , accounts are filtered after the join.
0
The answer is option B and option A is wrong
0
In first query you are saying " filter 10,000 records belonging to branch Agra from 100,000" . But this itself requires join operation and will cost more.

According to you we have 1 Agra branch as all branches are unique out of 100, then after filtering that 1 branch we multiply with 1,000,000 which is the total number of accounts and this results into 1 * 1, 000,000 operations and not 10,000.
+6

Just to make it more clear.

We should always do select before join to avoid unnecessary tuples being considered for join.
(For an SQL query this is not strictly required as most DBMS will rearrange the query to make it efficient)

Credits : @GateAspirant999

0

@Arjun SIr-

In the statement

• Filter 1000 accounts after processing 10,000 accounts for positive balance(

I think we are scanning for negative balance?Please check

B) as  b  is very small compared to a and d which is  also taken in account also from the fact that there is a condition  b-city = "Agra" branch which already filter  city as agra making it small  and before that σbal < 0 account ⋈ depositor filter  and give selected tables so ⋈  between them will give same result and better one.
Please correct If Not also 3 and 4 are wrong as σ
b-city = "Agra"   bal < 0 account ⋈ depositor selects city from account ⋈ depositor which is not in either Schemas.

by Junior (889 points)
+6
Yes. That is correct. We should always do select before join to avoid unnecessary tuples being considered for join.

(For an SQL query this is not strictly required as most DBMS will rearrange the query to make it efficient)
+1
Thanks  for verifying :)
+8

Sorry I missed this sentence "account and depositor relations are much bigger than the branch relation"

That makes A the best answer.

0
I also find this answer correct, as bal<0 is selecting on bigger table first
I guess the most efficient one is this one:

$\prod_{c\_name}(((\sigma_{b\_city="Agra"}branch)\bowtie(\sigma_{bal<0}account))\bowtie depositor)$

though it is not given in the options.

Am I right @Arjun sir?
by Active (2.5k points)
edited
+1
Actually it won't make much of a difference to query A. Because, we first filter positive balance account here and then find those in Agra. In query A this is done in reverse. So, it matters if number of positive balance accounts are very few. You can see the accepted anaswer now..
0
I was thinking structurally most correct query for being most efficient without considering anything about what the current instance of relation may contain.
+1
Not really. You are first filtering based on "balance" and natural join first filter on the join condition which is "branch" first in option A. So, which one is better depends on the instance.
0

which one is better depends on the instance

Q. Does that means that option A can be more efficient than the one I put above in some relation instances?

I think it will not be. And the one I put above will always be better than others. I followed one fact that I believe is correct: "Whenever there is a condition with constant, apply them first before performing joins, so that join computation will be smaller". In option A, account is joined as it is, without first applying condition (bal<0), which is what I tried to do.
(May be we should not waste time if this is trivial, but still was thinking above one will perform better than in options in all cases)

0

No it is not trivial. Intuitively that looks correct. But may not work always. I had given an example in the accepted answer. Your query closely follows the option A, but lets analyze it more

1. All the positive balance accounts in Agra branch and then depositor details of them. So,

• Get branch name from branch table after processing 100 records
• Filter 100,000 accounts with positive balance after processing 1,000,000 accounts b
• Filter 1000 accounts after processing 100,000 accounts belonging to Agra
• Now, it follows query A.

So, totally we processed, 100 + 1,000,000 + 100,000 records.

In query A we process, 100 + 1,000,000 + 10,000 records.

The difference is negligible, still numerically Query A performed better. But this is just because of the assumptions I used for the instance.

0

Didnt get your explanation exactly. So gave some more time to it. I tried using your number wherever possible and came up with this. Still feels option A will be not efficient than the query I suggested in any situation.

0
In option A, the select of branch = "Agra" is done before join is performed. Your braces are not correct.
+5

ohhhh right. Now I have made corrections and I can see what you were saying:

However, now I want to ask will option A will always perform better than my query.
Also want to generalize this. My query follows approach of "applying all constant based filters first before performing any join". While option A query follows below approach:
(I tried to figure out the idea behind this approach. May not be completely correct. But please correct me if I am wrong)

• apply constant based filters on next smallest table one by one to make them even smaller
• each time after applying constant based filter on next smallest table, join the result with next biggest table (if it makes sense to perform such join in order to obtain the desired output tuples)
• after performing join with next biggest table, apply constant based filter on it (if it makes sense to perform such join in order to obtain the desired output tuples)

So does the query 1 approach above always generate the better performance than the approach of my query?

Initially I felt yes that query 1 approach will always perform better than my query approach. But now I feel this completely depends on the relation instance and size of output of filters applied. Am I right with all that?

+4
Yes, as I told query A wont always perform better than your one - that might be why your query is not among the choices. When IIT profs make questions, usually answer is very easy to pick as there won't be much ambiguity. If your query was added to the options, we cannot easily pick an answer.

Now, optimizing query is not a simple task. So, I don't think there exist an algorithm like you said which works for all instances and gives the best query execution plan. Usually one has to see how a query is performing on an instance and thus optimize it further.
0
billion thanks sir for confirmation
ans A)
by Active (2.6k points)
by Junior (555 points)
As the conditional expression contains the comparison of attributes with constants so they can be applied independently  for filtering out some tuples  before performing the join operation over the relations because this will reduce the relation to smaller ones and join can be easily on smaller relations.

Given query is inefficient as join is performed before the selecting the tuples from the relation so we have to select the efficient query from options.

Option A have condition ( bal < 0 ) which is performed after performing join of branch and account So, inefficient.

Option C have condition ( b-city = 'agra' ) with account relation but b-city is not there in account. So, inefficient.

Option D is nearly same as C, so inefficient.

Option B is the most efficient one as filtering of tuples is performed before joining.

Therefore, Option B is correct.
by Active (2.6k points)
+2
Account & Depositor much bigger than Branch. B is very costly, A is answer !