0
votes

HDP-2.5.0.0 using Ambari 2.4.0.1

The Hive table ReportSetting is as follows :

id int

serializedreportsetting String

The column 'serializedreportsetting' is an XML data type in the source SQL Server db but is converted to String during Sqoop import, this is how it looks in SQL Server :

<ReportSettings4 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Trigger>
  <Manual>true</Manual>
  </Trigger>
<StartTime>
    <Year>8</Year>
    <Month>1</Month>
    <Day>1</Day>
    <Hour>0</Hour>
    <Minute>0</Minute>
  </StartTime>
  <ReportPeriod>
    <Month>0</Month>
    <Day>0</Day>
    <Hour>0</Hour>
    <Minute>5</Minute>
  </ReportPeriod>
  <Theft>
    <DigitalInput>true</DigitalInput>
    <Can>false</Can>
  </Theft>
  <SequenceNo>0</SequenceNo>
</ReportSettings4>

In Hive table :

<ReportSettings4 xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Trigger><Manual>true</Manual></Trigger><StartTime><Year>8</Year><Month>12</Month><Day>31</Day><Hour>23</Hour><Minute>34</Minute></StartTime><ReportPeriod><Month>0</Month><Day>0</Day><Hour>4</Hour><Minute>0</Minute></ReportPeriod><Theft><DigitalInput>false</DigitalInput><Can>false</Can></Theft><SequenceNo>3</SequenceNo></ReportSettings4>

The query that works fine on the SQL Server :

SELECT
r.VehicleId
,rs.value('(Trigger/Manual)[1]', 'bit') AS RS_Trigger_Manual, ,CAST(CONCAT(CASE WHEN rs.value('(StartTime/Year)[1]', 'int') < 10 THEN CONCAT('200',rs.value('(StartTime/Year)[1]', 'int')) ELSE CONCAT('20',rs.value('(StartTime/Year)[1]', 'int')) END,'-',rs.value('(StartTime/Month)[1]', 'int'),'-',rs.value('(StartTime/Day)[1]', 'int'),' ',rs.value('(StartTime/Hour)[1]', 'int'),':',rs.value('(StartTime/Minute)[1]', 'int'),':','00.000') AS datetime) AS RS_StartTime
,rs.value('(ReportPeriod/Month)[1]', 'int') AS RS_ReportPeriod_Month
,rs.value('(ReportPeriod/Day)[1]', 'int') AS RS_ReportPeriod_Day
,rs.value('(ReportPeriod/Hour)[1]', 'int') AS RS_ReportPeriod_Hour
,rs.value('(ReportPeriod/Minute)[1]', 'int') AS RS_ReportPeriod_Minute
,rs.value('(Theft/DigitalInput)[1]', 'bit') AS RS_Theft_DigitalInput
,rs.value('(Theft/Can)[1]', 'bit') AS RS_Theft_Can,rs.value('(SequenceNo)[1]', 'int') 

AS RS_SequenceNo FROM ReportSetting r
  CROSS APPLY SerializedReportSetting.nodes('/*') AS ReportSettings(rs)

I could think of/do the following :

  1. To use the CROSS APPLY, I guess lateral view needs to be used, here I don't have the serializedreportsetting as an array, so explode() won't work. Can someone validate if I'm thinking in the right direction
  2. I simply tried to get the data in serializedreportsetting as columns using the built-in xpath udf, however, I don't get any records, few trials are as follows :

    select xpath(SerializedReportSetting,'/*') from ReportSetting limit 1;

    select xpath(SerializedReportSetting,'/ReportSettings4') from ReportSetting limit 1;

    select xpath(SerializedReportSetting,'/Trigger/Manual') from ReportSetting limit 1;

**********UPDATE-1**********

I used the regexp_replace to handle the above challenge :

SELECT id,
  xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/Manual')        AS RS_Trigger_Manual,
  xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/DriveChange')   AS RS_Trigger_DriveChange
FROM
  (SELECT id,
    regexp_replace(SerializedReportSetting, 'ReportSettings+\\d','ReportSettings') AS SerializedReportSetting
  FROM reportsetting
  WHERE id IN (1701548,3185,1700231,1700232)
  ) reportsetting_regex;
1

1 Answers

1
votes

In xpath they explicitly say:

The xpath() function always returns a hive array of strings. If the expression results in a non-text value (e.g., another xml node) the function will return an empty array

So you can either use: xpath(SerializedReportSetting,'/ReportSettings4/Trigg‌​er/Manual/text()') from ReportSetting limit 1;

Or an even better option is to use xpath_boolean/xpath_int:

xpath_boolean - Returns true if the XPath expression evaluates to true, or if a matching node is found.

xpath_boolean(SerializedReportSetting,'/ReportSettings4/Trigg‌​er/Manual') from ReportSetting limit 1;

xpath_short, xpath_int, xpath_long These functions return an integer numeric value, or the value zero if no match is found, or a match is found but the value is non-numeric. Mathematical operations are supported. In cases where the value overflows the return type, then the maximum value for the type is returned.

xpath_int(SerializedReportSetting,'/ReportSettings4/ReportPeriod/Month') from ReportSetting limit 1;