3
votes

I am using CDH-5.4.4 Cloudera Edition, I have a CSV file in HDFS location, My requirement is to perform Real time SQL queries on Hadoop Environement (OLTP).

So I decided to go with Impala, I have created MetaStore table to a CSV file, then execuing query in impala editor (within HUE application) .

When i am executing below query, i am getting error like

"AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT City); deviating function: count(DISTINCT Country)".

CSV File

OrderID,CustomerID,City,Country
Ord01,Cust01,Aachen,Germany
Ord02,Cust01,Albuquerque,USA
Ord03,Cust01,Aachen,Germany
Ord04,Cust02,Arhus,Denmark
Ord05,Cust02,Arhus,Denmark

Problamatic Query

Select CustomerID,Count(Distinct City),Count(Distinct Country) From CustomerOrders Group by CustomerID

Problem:

Unable to execute the Impala Query with More than one Distinct Values in an Query.. I have searched over internet they provide NDV() method as a workaround, But NDV method only returns approximate count of distinct values, I need Exact unique count for more than one fields.

Expectation:

What is the best way to do Exact unique count for more than one fields? Kindly modify the above query to work with Impala.

Note: This is not my original table, I have replicate for the forum question.

2

2 Answers

2
votes

I've the same problem in Impala. Here is my workaround:

SELECT CustomerID
    ,sum(nr_of_cities)
    ,sum(nr_of_countries)
FROM (
    SELECT CustomerID
        ,Count(DISTINCT City) AS nr_of_cities
        ,0 AS nr_of_countries
    FROM CustomerOrders
    GROUP BY CustomerID

    UNION ALL

    SELECT CustomerID
        ,0 AS nr_of_cities
        ,Count(DISTINCT Country) AS nr_of_countries
    FROM CustomerOrders
    GROUP BY CustomerID
) AS aa
GROUP BY CustomerID
1
votes

I think this can be done cleaner (untested):

WITH
countries AS
(
 SELECT CustomerID
       ,COUNT(DISTINCT City) AS nr_of_countries
 FROM CustomerOrders
 GROUP BY 1
)
,
cities AS
(
 SELECT CustomerID
       ,COUNT(DISTINCT City) AS nr_of_cities
 FROM CustomerOrders
 GROUP BY 1
)
SELECT CustomerID
      ,nr_of_cities
      ,nr_of_countries
 FROM cities INNER JOIN countries USING (CustomerID)