Hive Interview Questions
Contents
Explain HIVE architecture
Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semi-structured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is theSerDe.
What is a map side join?
Join is a clause that combines the records of two tables (or Data-Sets).
Assume that we have two tables A and B. When we perform join operation on them, it will return the records which are the combination of all columns o f A and B.
Assume that we have two tables A and B. When we perform join operation on them, it will return the records which are the combination of all columns o f A and B.
Now let us understand the functionality of normal join with an example..
Whenever, we apply join operation, the job will be assigned to a Map Reduce task which consists of two stages- a ‘Map stage’ and a ‘Reduce stage’. A mapper’s job during Map Stage is to “read” the data from join tables and to “return” the ‘join key’ and ‘join value’ pair into an intermediate file. Further, in the shuffle stage, this intermediate file is then sorted and merged. The reducer’s job during reduce stage is to take this sorted result as input and complete the task of join.
- Map-side Join is similar to a join but all the task will be performed by the mapper alone.
- The Map-side Join will be mostly suitable for small tables to optimize the task.
How will the map-side join optimize the task?
Assume that we have two tables of which one of them is a small table. When we submit a map reduce task, a Map Reduce local task will be created before the original join Map Reduce task which will read data of the small table from HDFS and store it into an in-memory hash table. After reading, it serializes the in-memory hash table into a hash table file.
In the next stage, when the original join Map Reduce task is running, it moves the data in the hash table file to the Hadoop distributed cache, which populates these files to each mapper’s local disk. So all the mappers can load this persistent hash table file back into the memory and do the join work as before. The execution flow of the optimized map join is shown in the figure below. After optimization, the small table needs to be read just once. Also if multiple mappers are running on the same machine, the distributed cache only needs to push one copy of the hash table file to this machine.
Advantages of using Map-side join:
- Map-side join helps in minimizing the cost that is incurred for sorting and merging in theshuffle and reduce stages.
- Map-side join also helps in improving the performance of the task by decreasing the time to finish the task.
Disadvantages of Map-side join:
- Map side join is adequate only when one of the tables on which you perform map-side join operation is small enough to fit into the memory. Hence it is not suitable to perform map-side join on the tables which are huge data in both of them.
Explain Schema on Read Versus Schema on Write?
In a traditional database, a table’s schema is enforced at data load time. If the data being
loaded doesn’t conform to the schema, then it is rejected. This design is sometimes called
schema on write because the data is checked against the schema when it is written into the
database.
Hive, on the other hand, doesn’t verify the data when it is loaded, but rather when a query
is issued. This is called schema on read.
There are trade-offs between the two approaches. Schema on read makes for a very fast
initial load, since the data does not have to be read, parsed, and serialized to disk in the
database’s internal format. The load operation is just a file copy or move. It is more
flexible, too: consider having two schemas for the same underlying data, depending on the
analysis being performed. (This is possible in Hive using external tables; see Managed
Tables and External Tables.)
Schema on write makes query time performance faster because the database can index
columns and perform compression on the data. The trade-off, however, is that it takes
longer to load data into the database. Furthermore, there are many scenarios where the
schema is not known at load time, so there are no indexes to apply, because the queries
have not been formulated yet. These scenarios are where Hive shines.
Difference between "Sort By" and "Group by" in Hive. How they work?
Global SORT use ORDER BY clause.
Output of each reducer - Use SORT BY clause. Also called partial sort.
- Difference between Hive internal tables and external tables?
Managed Tables and External Tables
When you create a table in Hive, by default Hive will manage the data, which means that
Hive moves the data into its warehouse directory. Alternatively, you may create an
external table, which tells Hive to refer to the data that is at an existing location outside
the warehouse directory.
The difference between the two table types is seen in the LOAD and DROP semantics. Let’s
consider a managed table first.
When you load data into a managed table, it is moved into Hive’s warehouse directory.
For example, this:
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the
managed_table table, which is hdfs://user/hive/warehouse/managed_table.
Explain various data types in HIVE
Explain type of index in HIVE?
There are currently two index types:
compact and bitmap. (The index implementation was designed to be pluggable, so it’s
expected that a variety of implementations will emerge for different use cases.)
Compact indexes store the HDFS block numbers of each value, rather than each file offset,
so they don’t take up much disk space but are still effective for the case where values are
clustered together in nearby rows. Bitmap indexes use compressed bitsets to efficiently
store the rows that a particular value appears in, and they are usually appropriate for low cardinality
columns (such as gender or country).
What is the main difference between Hive and SQL
How to optimize Hive Performance?
Technique #1: Use Tez
Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine. I won’t go into details about the many benefits of using Tez which are mentioned here; instead, I want to make a simple recommendation: if it’s not turned on by default in your environment, use Tez by setting to ‘true’ the following in the beginning of your Hive query:
[code language=”SQL”]
set hive.execution.engine=tez;
[/code]
With the above setting, every HIVE query you execute will take advantage of Tez.
set hive.execution.engine=tez;
[/code]
With the above setting, every HIVE query you execute will take advantage of Tez.
Technique #2: Use ORCFile
Hive supports ORCfile, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression and more.
Using ORCFile for every HIVE table should really be a no-brainer and extremely beneficial to get fast response times for your HIVE queries.
As an example, consider two large tables A and B (stored as text files, with some columns not all specified here), and a simple query like:
[code language=”SQL”]
SELECT A.customerID, A.name, A.age, A.address join
B.role, B.department, B.salary
ON A.customerID=B.customerID;
[/code]
SELECT A.customerID, A.name, A.age, A.address join
B.role, B.department, B.salary
ON A.customerID=B.customerID;
[/code]
This query may take a long time to execute since tables A and B are both stored as TEXT. Converting these tables to ORCFile format will usually reduce query time significantly:
[code language=”SQL”]
CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
CREATE TABLE B_ORC (
customerID int, role string, salary float, department string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
customerID int, role string, salary float, department string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE B_ORC SELECT * FROM B;
SELECT A_ORC.customerID, A_ORC.name,
A_ORC.age, A_ORC.address join
B_ORC.role, B_ORC.department, B_ORC.salary
ON A_ORC.customerID=B_ORC.customerID;
[/code]
A_ORC.age, A_ORC.address join
B_ORC.role, B_ORC.department, B_ORC.salary
ON A_ORC.customerID=B_ORC.customerID;
[/code]
ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also uncompressed storage.
Converting base tables to ORC is often the responsibility of your ingest team, and it may take them some time to change the complete ingestion process due to other priorities. The benefits of ORCFile are so tangible that I often recommend a do-it-yourself approach as demonstrated above – convert A into A_ORC and B into B_ORC and do the join that way, so that you benefit from faster queries immediately, with no dependencies on other teams.
Technique #3: Use Vectorization
Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.
Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
[code language=”SQL”]
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
[/code]
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
[/code]
Technique #4: cost based query optimization
Hive optimizes each query’s logical and physical execution plan before submitting for final execution. These optimizations are not based on the cost of the query – that is, until now.
A recent addition to Hive, Cost-based optimization, performs further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others.
To use cost-based optimization (also known as CBO), set the following parameters at the beginning of your query:
[code language=”SQL”]
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
[/code]
[code language=”SQL”]
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
[/code]
Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.
For example, in a table tweets we want to collect statistics about the table and about 2 columns: “sender” and “topic”:
[code language=”SQL”]
analyze table tweets compute statistics;
analyze table tweets compute statistics for columns sender, topic;
[/code]
[code language=”SQL”]
analyze table tweets compute statistics;
analyze table tweets compute statistics for columns sender, topic;
[/code]
With HIVE 0.14 (on HDP 2.2) the analyze command works much faster, and you don’t need to specify each column, so you can just issue:
[code language=”SQL”]
analyze table tweets compute statistics for columns;
[/code]
analyze table tweets compute statistics for columns;
[/code]
That’s it. Now executing a query using this table should result in a different execution plan that is faster because of the cost calculation and different execution plan created by Hive.
Explain the joins different from RDBMS
Semi joins
Consider this IN subquery, which finds all the items in the things table that are in the
sales table:
SELECT *
FROM things
WHERE things.id IN (SELECT id from sales);
We can also express it as follows:
hive> SELECT * FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
2 Tie
4 Coat
3 Hat
There is a restriction that we must observe for LEFT SEMI JOIN queries: the right table
(sales) may appear only in the ON clause. It cannot be referenced in a SELECT expression,
for example.
Map joins
Consider the original inner join again:
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
If one table is small enough to fit in memory, as things is here, Hive can load it into
memory to perform the join in each of the mappers. This is called a map join.
The job to execute this query has no reducers, so this query would not work for a RIGHT or
FULL OUTER JOIN, since absence of matching can be detected only in an aggregating
(reduce) step across all the inputs.
Map joins can take advantage of bucketed tables (see Buckets), since a mapper working
on a bucket of the left table needs to load only the corresponding buckets of the right table
to perform the join. The syntax for the join is the same as for the in-memory case shown
earlier; however, you also need to enable the optimization with the following:
SET hive.optimize.bucketmapjoin=true;
Can we write map reduce scripts in HIVE?
Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE clauses make
it possible to invoke an external script or program from Hive.
Yes we can write map reduce scripts in HIVE in python or java.
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
(year, temp, q) = line.strip().split()
if (temp != "9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
We can use the script as follows:
hive> ADD FILE /Users/tom/book-workspace/hadoop-book/ch17-hive/
src/main/python/is_good_quality.py;
hive> FROM records2
> SELECT TRANSFORM(year, temperature, quality)
> USING 'is_good_quality.py'
> AS year, temperature;
1950 0
1950 22
1950 -11
1949 111
1949 78
Before running the query, we need to register the script with Hive. This is so Hive knows
to ship the file to the Hadoop cluster (see Distributed Cache).
The query itself streams the year, temperature, and quality fields as a tab-separated line
to the is_good_quality.py script, and parses the tab-separated output into year and
temperature fields to form the output of the query.
This example has no reducers. If we use a nested form for the query, we can specify a map
and a reduce function. This time we use the MAP and REDUCE keywords, but SELECT
TRANSFORM in both cases would have the same result. (Example 2-10 includes the source
for the max_temperature_reduce.py script):
FROM (
FROM records2
MAP year, temperature, quality
USING 'is_good_quality.py'
AS year, temperature) map_output
REDUCE year, temperature
USING 'max_temperature_reduce.py'
AS year, temperature;
What type of user defined functions exists in HIVE?
A UDF operates on a single row and produces a single row as its output. Most functions, such as mathematical functions and string functions, are of this type.
A UDF must satisfy the following two properties:
A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
A UDF must implement at least one evaluate() method.
A UDAF works on multiple input rows and creates a single output row. Aggregate functions include such functions as COUNT and MAX.
A UDAF must satisfy the following two properties:
A UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF;
An evaluator must implement five methods:
init()
iterate()
terminatePartial()
merge()
terminate()
A UDTF operates on a single row and produces multiple rows — a table — as output.
A UDTF must be a subclass of org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
A custom UDTF can be created by extending the GenericUDTF abstract class and then implementing the
initialize,
process,
and possibly close methods.
The initialize method is called by Hive to notify the UDTF the argument types to expect.
The UDTF must then return an object inspector corresponding to the row objects that the UDTF will generate.
Once initialize() has been called, Hive will give rows to the UDTF using the process() method.
While in process(), the UDTF can produce and forward rows to other operators by calling forward().
Lastly, Hive will call the close() method when all the rows have passed to the UDTF.
How HIVE views are different from RDBMS views?
How do you check the performance of a HIVE query?
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
Assume there is a array column in a table, how will you show the data in row format?
hive> SELECT explode(x) AS y FROM arrays;
a
b
c
d
e
Explain partition and bucket concept in HIVE?
Partitions and Buckets
Hive organizes tables into partitions — a way of dividing a table into coarse-grained parts
based on the value of a partition column, such as a date. Using partitions can make it
faster to do queries on slices of the data.
Tables or partitions may be subdivided further into buckets to give extra structure to the
data that may be used for more efficient queries. For example, bucketing by user ID means
we can quickly evaluate a user-based query by running it on a randomized sample of the
total set of users.
Buckets
There are two reasons why you might want to organize your tables (or partitions) into
buckets. The first is to enable more efficient queries. Bucketing imposes extra structure on
the table, which Hive can take advantage of when performing certain queries. In
particular, a join of two tables that are bucketed on the same columns — which include the
join columns — can be efficiently implemented as a map-side join.
The second reason to bucket a table is to make sampling more efficient. When working
with large datasets, it is very convenient to try out queries on a fraction of your dataset
while you are in the process of developing or refining them. We will see how to do efficient sampling at the end of this section.
First, let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY
clause to specify the columns to bucket on and the number of buckets:
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Here we are using the user ID to determine the bucket (which Hive does by hashing the
value and reducing modulo the number of buckets), so any particular bucket will
effectively have a random set of users in it.
In the map-side join case, where the two tables are bucketed in the same way, a mapper
processing a bucket of the left table knows that the matching rows in the right table are in
its corresponding bucket, so it need only retrieve that bucket (which is a small fraction of
all the data stored in the right table) to effect the join. This optimization also works when
the number of buckets in the two tables are multiples of each other; they do not have to
have exactly the same number of buckets. The HiveQL for joining two bucketed tables is
shown in Map joins.
The data within a bucket may additionally be sorted by one or more columns. This allows
even more efficient map-side joins, since the join of each bucket becomes an efficient
merge sort. The syntax for declaring that a table has sorted buckets is:
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
How can we make sure the data in our table is bucketed? Although it’s possible to load
data generated outside Hive into a bucketed table, it’s often easier to get Hive to do the
bucketing, usually from an existing table.
WARNING
Hive does not check that the buckets in the datafiles on disk are consistent with the buckets in the table definition
(either in number or on the basis of bucketing columns). If there is a mismatch, you may get an error or undefined
behavior at query time. For this reason, it is advisable to get Hive to perform the bucketing.
Take an unbucketed users table:
hive> SELECT * FROM users;
0 Nat
2 Joe
3 Kay
4 Ann
To populate the bucketed table, we need to set the hive.enforce.bucketing property to
true so that Hive knows to create the number of buckets declared in the table definition.
Then it is just a matter of using the INSERT command:
INSERT OVERWRITE TABLE bucketed_users
SELECT * FROM users;
Physically, each bucket is just a file in the table (or partition) directory. The filename is not
important, but bucket n is the nth file when arranged in lexicographic order. In fact,
buckets correspond to MapReduce output file partitions: a job will produce as many
buckets (output files) as reduce tasks. We can see this by looking at the layout of the
bucketed_users table we just created. Running this command:
hive> dfs -ls /user/hive/warehouse/bucketed_users;
shows that four files were created, with the following names (the names are generated by
Hive):
000000_0
000001_0
000002_0
000003_0
The first bucket contains the users with IDs 0 and 4, since for an INT the hash is the
integer itself, and the value is reduced modulo the number of buckets — four, in this case:
[113]
hive> dfs -cat /user/hive/warehouse/bucketed_users/000000_0;
0Nat
4Ann
We can see the same thing by sampling the table using the TABLESAMPLE clause, which
restricts the query to a fraction of the buckets in the table rather than the whole table:
hive> SELECT * FROM bucketed_users
> TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
4 Ann
0 Nat
Bucket numbering is 1-based, so this query retrieves all the users from the first of four
buckets. For a large, evenly distributed dataset, approximately one-quarter of the table’s
rows would be returned. It’s possible to sample a number of buckets by specifying a
different proportion (which need not be an exact multiple of the number of buckets, as
sampling is not intended to be a precise operation). For example, this query returns half of
the buckets:
hive> SELECT * FROM bucketed_users
> TABLESAMPLE(BUCKET 1 OUT OF 2 ON id);
4 Ann
0 Nat
2 Joe
Sampling a bucketed table is very efficient because the query only has to read the buckets
that match the TABLESAMPLE clause. Contrast this with sampling a nonbucketed table using
the rand() function, where the whole input dataset is scanned, even if only a very small
sample is needed:
hive> SELECT * FROM users
> TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());
2 Joe
Explain multiple inserts in HIVE table.
Multitable insert In HiveQL, you can turn the INSERT statement around and start with the FROM clause for
the same effect:
FROM source
INSERT OVERWRITE TABLE target
SELECT col1, col2;
The reason for this syntax becomes clear when you see that it’s possible to have multiple
INSERT clauses in the same query. This so-called multitable insert is more efficient than
multiple INSERT statements because the source table needs to be scanned only once to
produce the multiple disjoint outputs.
Here’s an example that computes various statistics over the weather dataset:
FROM records2
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year
SELECT year, COUNT(1)
WHERE temperature != 9999 AND quality IN (0, 1, 4, 5, 9)
GROUP BY year;
There is a single source table (records2), but three tables to hold the results from three
different queries over the source.
Can you create data structure (columns and data type) of a new table which is similar to existing table?
CREATE TABLE new_table LIKE existing_table;
What is the difference between Drop table verses TRUNCATE?
Dropping Tables
The DROP TABLE statement deletes the data and metadata for a table. In the case of external
tables, only the metadata is deleted; the data is left untouched.
If you want to delete all the data in a table but keep the table definition, use TRUNCATE
TABLE. For example:
TRUNCATE TABLE my_table;
This doesn’t work for external tables; instead, use dfs -rmr (from the Hive shell) to
remove the external table directory directly.
Is LIMIT clause in HIVE really random?
Even though the documentation states it returns rows at random, it's not actually true. It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.
As soon as you slap a order by x DESC limit 5 on there, it returns the last 5 rows of whatever you're selecting from. To get rows returned at random, you would need to use something like: order by rand() LIMIT 1
However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset
Is Hive suitable to be used for OLTP systems? Why?
No Hive does not provide insert and update at row level. So it is not suitable for OLTP system.
Can a table be renamed in Hive?
Alter Table table_name RENAME TO new_name
Can we change the data type of a column in a hive table?
Using REPLACE column option
ALTER TABLE table_name REPLACE COLUMNS ......
What is a metastore in Hive?
It is a relational database storing the metadata of hive tables, partitions, Hive
databases etc
What is the need for custom Serde?
Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. SO users need to write their own java code to satisfy their data format requirements.
Why do we need Hive?
Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a database like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.
What is the default location where hive stores table data?
hdfs://namenode_server/user/hive/warehouse
What are the three different modes in which hive can be run?
- Local mode
- Distributed mode
- Pseudodistributed mode
List few commonly used Hive services?
- Command Line Interface (cli)
- Hive Web Interface (hwi)
- HiveServer (hiveserver)
- Printing the contents of an RC file using the tool rcfilecat.
- Jar
- Metastore
Is there a date data type in Hive?
Yes. The TIMESTAMP data types stores date in java.sql.timestamp format
What are collection data types in Hive?
There are three collection data types in Hive.
ARRAY
MAP
STRUCT
What is a Hive variable? What for we use it?
The hive variable is variable created in the Hive environment that can be referenced by Hive
scripts. It is used to pass some values to the hive queries when the query starts executing.
Can hive queries be executed from script files? How?
Using the source command.
Example − Hive> source /path/to/file/file_with_query.hql
What is the importance of .hiverc file?
It is a file containing list of commands needs to run when the hive CLI starts. For example setting the strict mode to be true etc.
What are the default record and field delimiter used for hive text files?
The default record delimiter is − \n
And the filed delimiters are −
\001,\002,\003
What do you mean by schema on read?
The schema is validated with the data when reading the data and not enforced when writing data.
How do you list all databases whose name starts with p?
SHOW DATABASES LIKE ‘p.*’
What does the “USE” command in hive do?
With the use command you fix the database on which all the subsequent hive queries will run.
How can you delete the DBPROPERTY in Hive?
There is no way you can delete the DBPROPERTY.
What is the significance of the line set hive. mapred . mode =strict;
It sets the mapreduce jobs to strict mode.By which the queries on partitioned tables can not run without a WHERE clause. This prevents very large job running for long time.
How do you check if a particular partition exists?
This can be done with following query
SHOW PARTITIONS table_name PARTITION
(partitioned_column =’ partition_value’)
Which java class handles the Input record encoding into files which store the tables in Hive?
org.apache.hadoop.mapred.TextInputFormat
Which java class handles the output record encoding into files which result from Hive queries?
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
What is the significance of ‘IF EXISTS” clause while dropping a table?
When we issue the command DROP TABLE IF EXISTS table_name
Hive throws an error if the table being dropped does not exist in the first place.
When you point a partition of a hive table to a new directory, what happens to the data?
The data stays in the old location. It has to be moved manually.
Write a query to insert a new column? Can you add a column with a default value in Hive?
ALTER TABLE test1 ADD COLUMNS (access_count1 int); You cannot add a column with a default value in Hive. No changes to that files backing your table will happen as a result of adding the column. Hive handles the "missing" data by interpreting NULL as the value for every cell in that column.
In Hive you essentially need to rewrite the whole table, this time with the column populated. It may be easier to rerun your original query with the new column. Or you could add the column to the table you have now, then select all of its columns plus value for the new column.
Since there's no IFNULL, ISNULL, or NVL function supported on Hive, how to convert NULL to 0?
COALESCE(column, CAST(0 AS BIGINT)) or NVL(value, default value) Returns default value if value is null else returns value.
Does the archiving of Hive tables give any space saving in HDFS?
No. It only reduces the number of files which becomes easier for namenode to manage.
How can you stop a partition form being queried?
By using the ENABLE OFFLINE clause with ALTER TABLE statement.
While loading data into a hive table using the LOAD DATA clause, how do you specify it is a hdfs file and not a local file ?
By Omitting the LOCAL CLAUSE in the LOAD DATA statement.
If you omit the OVERWRITE clause while creating a hive table, what happens to file which are new and files which already exist?
The new incoming files are just added to the target directory and the existing files are simply
overwritten. Other files whose name does not match any of the incoming files will continue to exist.
If you add the OVERWRITE clause then all the existing data in the directory will be deleted before new data is written.
What does the following query do? INSERT OVERWRITE TABLE employees PARTITION (country, state) SELECT .., FROM staged_employees;
It creates partition on table employees with partition values coming from the columns in the select clause. It is called Dynamic partition insert.
What is a Table generating Function on hive?
A table generating function is a function which takes a single column as argument and expands it to multiple column or rows. Example explode
How can Hive avoid mapreduce?
If we set the property hive.exec.mode.local.auto to true then hive will avoid mapreduce to fetch query results.
What is the difference between LIKE and RLIKE operators in Hive?
The LIKE operator behaves the same way as the regular SQL operators used in select queries.
Example −
street_name like ‘%Chi’
But the RLIKE operator uses more advance regular expressions which are available in java
Example − street_name RLIKE ‘.* Chi|Oho.*’ which will select any word which has either chi or oho in it.
Is it possible to create Cartesian join between 2 tables, using Hive?
No. As this kind of Join can not be implemented in mapreduce As part of Optimizing the queries in Hive.
What should be the order of table size in a join query?
In a join query the smallest table to be taken in the first position and largest table should be taken in the last position.
What is the usefulness of the DISTRIBUTED BY clause in Hive?
DISTRIBUTE BY controls how map output is divided among reducers. By default, MapReduce computes a hash on the keys output by mappers and tries to evenly distribute the key-value pairs among the available reducers using the hash values. Say we want the data for each value in a column to be captured together. We can use DISTRIBUTE BY to ensure that the records for each go to the same reducer. DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for processing, Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause if it's in same query. DISTRIBUTE BY is a good workaround to utilize less memory when you have a memory-intensive job, and forces Hadoop to use Reducers instead of having a Map-only job. Essentially Mappers do some grouping of the rows based on the DISTRIBUTE BY columns specified, which make the framework make less work overall, and pass on these aggregates to the Reducers.
What is the main difference between dynamic and static partitioning in hive?
Partitioning in Hive is very useful to prune data during query to reduce query times.
Partitions are created when data is inserted into table. Depending on how you load data you would need partitions. Usually when loading files (big files) into Hive tables static partitions are preferred. That saves your time in loading data compared to dynamic partition. You "statically" add a partition in table and move the file into the partition of the table. Since the files are big they are usually generated in HDFS. You can get the partition column value form the filename, day of date etc without reading the whole big file.
Incase of dynamic partition whole big file i.e. every row of the data is read and data is partitioned through a MR job into the destination tables depending on certain field in file. So usually dynamic partition are useful when you are doing sort of a ETL flow in your data pipeline. e.g. you load a huge file through a move command into a Table X. then you run a inert query into a Table Y and partition data based on field in table X say day , country. You may want to further run a ETL step to partition the data in country partition in Table Y into a Table Z where data is partitioned based on cities for a particular country only. etc.
Thus depending on your end table or requirements for data and in what form data is produced at source you may choose static or dynamic partition.
How do you convert string to int or bigint in hive?
SELECT CAST('00321' AS BIGINT) FROM table;
SELECT CAST('00321' AS INT) FROM table;
Can the name of a view be same as the name of a hive table?
No. The name of a view must be unique when compared to all other tables and views present in the same database.
Can we LOAD data into a view?
No. A view can not be the target of a INSERT or LOAD statement.
What types of costs are associated in creating index on hive tables?
Indexes occupies space and there is a processing cost in arranging the values of the column on which index is created.
Give the command to see the indexes on a table.
SHOW INDEX ON table_name
This will list all the indexes created on any of the columns in the table table_name.
What is Hadoop Buffering vs Streaming?
Hadoop Streaming in general refers to using custom made python or shell scripts to perform your map-reduce logic. ( For example, using the Hive TRANSFORM keyword.)
Hadoop buffering, in this context, refers to the phase in a map-reduce job of a Hive query with a join, when records are read into the reducers, after having been sorted and grouped coming out of the mappers. The author is explaining why you should order the join clauses i n a Hive query, so that the largest tables are last; because it helps optimize the implementation of joins in Hive.
What is the importance of STREAM TABLE?
Hive Join Optimization: Stream table in Joins
Joins play a important role when you need to get information from multiple tables but when you have 1.5 Billion+ records in one table and joining it with master table, the order of joining table becomes very important.
Lets consider we have a table named 'foo' with 1.5 Billion+ records which we are joining with another table called 'bar' with 100 records like: select foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a;
Since Hive streams right-most table(bar) and buffer(in-memory) other tables(foo) before performing map-side/reduce-side join. Hence, if you buffer 1.5 Billion+ records, your join query will fail as buffering 1.5 Billion records will definitely results in Java-Heap space exception.
So, to overcome this limitation and free the user to remember the order of joining tables based on their record-size, Hive provides a key-word /*+ STREAMTABLE(foo) */ which tells Hive Analyzer to stream table foo.
select /*+ STREAMTABLE(foo) */ foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a;
Hence, in this way user can be free of remembering the order of joining tables.
Can a partition be archived? What are the advantages and Disadvantages?
Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.
What is a generic UDF in hive?
It is a UDF which is created using a java program to server some specific need not covered under the existing functions in Hive. It can detect the type of input argument programmatically and provide appropriate response.
The following statement failed to execute. What can be the cause? LOAD DATA LOCAL INPATH ‘${env:HOME}/country/state/’ OVERWRITE INTO TABLE address;
The local inpath should contain a file and not a directory. The $env:HOME is a valid variable available in the hive environment.
How do you specify the table creator name when creating a table in Hive?
The TBLPROPERTIES clause is used to add the creator name while creating a table.
The TBLPROPERTIES is added like − TBLPROPERTIES(‘creator’=‘Joan’)
What is SQL INTERFACE FOR HADOOP? HCATALOG
Since 2008, Hive has reigned as the defacto SQL interface for Hadoop as it provides a relational view through SQL like language to data within Hadoop. HCatalog publishes this same interface but abstracts it for data beyond Hive. It also publishes a REST interface for external use so that your existing tools can interact with Hadoop in the way you expect… via ODBC and JDBC into SQL!
You should see Impala as "SQL on HDFS", while Hive is more "SQL on Hadoop".
In other words, Impala doesn't even use Hadoop at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.
The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so by short-circuiting Map/Reduce altogether you can get some pretty big gain in runtime.
That being said, Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one job would be costly I would recommend Hive, but Impala can be awesome for small ad-hoc queries, for example for data scientists or business analysts who just want to take a look and analyze some data without building robust jobs.
What is the Hive configuration precedence order?
There is a precedence hierarchy to setting properties. In the following list, lower numbers take precedence over higher numbers:
1. The Hive SET command
2. The command line -hiveconf option
3. hive-site.xml
4. hive-default.xml
5. hadoop-site.xml (or, equivalently, core-site.xml, hdfs-site.xml, and mapred-site.xml)
6. hadoop-default.xml (or, equivalently, core-default.xml, hdfs-default.xml, and mapred-default.xml)
How do change settings within Hive Session?
We can change settings from within a session, too, using the SET command. This is useful for changing Hive or MapReduce job settings for a particular query. For example, the following command ensures buckets are populated according to the table definition.
hive> SET hive.enforce.bucketing=true;
To see the current value of any property, use SET with just the property name:
hive> SET hive.enforce.bucketing;
hive.enforce.bucketing=true
By itself, SET will list all the properties and their values set by Hive. This list will not include Hadoop defaults, unless they have been explicitly overridden in one of the ways covered in the above answer. Use SET -v to list all the properties in the system, including Hadoop defaults.
How to print header on Hive query results?
We need to use following set command before our query to show column headers in STDOUT.
hive> set hive.cli.print.header=true;
How to get detailed description of a table in Hive?
Use below hive command to get a detailed description of a hive table.
hive> describe extended <tablename>;
How to access sub directories recursively in Hive queries?
To process directories recursively in Hive, we need to set below two commands in hive session. These two parameters work in conjunction.
hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;
Now hive tables can be pointed to the higher level directory. This is suitable for a scenario where the directory structure is as following: /data/country/state/city
How to skip header rows from a table in Hive?
Suppose while processing some log files, we may find header records.
System=….
Version=…
Sub-version=….
Like above, It may have 3 lines of headers that we do not want to include in our Hive query. To skip header lines from our tables in Hive we can set a table property that will allow us to skip the header lines.
CREATE EXTERNAL TABLE userdata (
name STRING,
job STRING,
dob STRING,
id INT,
salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ STORED AS TEXTFILE
LOCATION ‘/user/data’
TBLPROPERTIES("skip.header.line.count"="3”);
Is it possible to create multiple table in hive for same data?
As hive creates schema and append on top of an existing data file. One can have multiple schema for one data file, schema will be saved in hive’s metastore and data will not be parsed or serialized to disk in given schema. When we will try to retrieve data, schema will be used. For example if we have 5 column (name, job, dob, id, salary) in the data file present in hive metastore then, we can have multiple schema by choosing any number of columns from the above list. (Table with 3 columns or 5 columns or 6 columns).
But while querying, if we specify any column other than above list, will result in NULL values.
What is the maximum size of string data type supported by Hive?
Maximum size is 2 GB.
What are the Binary Storage formats supported in Hive?
By default Hive supports text file format, however hive also supports below binary formats.
Sequence Files, Avro Data files, RCFiles, ORC files, Parquet files
Sequence files: General binary format. splittable, compressible and row oriented. a typical example can be. if we have lots of small file, we may use sequence file as a container, where file name can be a key and content could stored as value. it support compression which enables huge gain in performance.
Avro datafiles: Same as Sequence file splittable, compressible and row oriented except support of schema evolution and multilingual binding support.
RCFiles: Record columnar file, it’s a column oriented storage file. it breaks table in row split. in each split stores that value of first row in first column and followed sub subsequently.
ORC Files: Optimized Record Columnar files
Is HQL case sensitive?
HQL is not case sensitive.
Describe CONCAT function in Hive with Example?
CONCAT function will concatenate the input strings. We can specify any number of strings separated by comma.
Example: CONCAT (‘Hive’,’-‘,’is’,’-‘,’a’,’-‘,’data warehouse’,’-‘,’in Hadoop’);
Output: Hive-is-a-data warehouse-in Hadoop
So, every time we delimit the strings by ‘-‘. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.
Syntax: CONCAT_WS (‘-‘,’Hive’,’is’,’a’,’data warehouse’,’in Hadoop’);
Output: Hive-is-a-data warehouse-in Hadoop
Describe REPEAT function in Hive with example?
REPEAT function will repeat the input string n times specified in the command.
Example: REPEAT(‘Hive’,3);
Output: HiveHiveHive.
Describe REVERSE function in Hive with example?
REVERSE function will reverse the characters in a string.
Example: REVERSE(‘Hive’);
Output: eviH
Describe TRIM function in Hive with example?
TRIM function will remove the spaces associated with a string.
Example: TRIM(‘ Hadoop ‘);
Output: Hadoop.
If we want to remove only leading or trailing spaces then we can specify the below commands respectively.
LTRIM(‘ Hadoop’);
RTRIM(‘Hadoop ‘);
Describe RLIKE in Hive with an example?
RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don’t need to put % symbol for a simple match in RLIKE.
Examples: ‘Express’ RLIKE ‘Exp’ –> True
‘Express’ RLIKE ‘^E.*’ –> True (Regular expression)
Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value ‘Express ‘ (2 spaces additionally) and B has value ‘Express’. In these situations, RLIKE will work better without using TRIM.
‘Express ‘ RLIKE ‘Express’ –> True
Note: RLIKE evaluates to NULL if A or B is NULL.
How do you import and load .csv file into and out of HIVE?
Import
DROP TABLE IF EXISTS mytable;
CREATE TABLE mytable (id_number STRING,country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH /mytable.csv OVERWRITE INTO TABLE mytable;
Export
set hive.cli.print.header=true before the SELECT
hive -e 'set hive.cli.print.header=true; select * from your_Table' | sed 's/[\t]/,/g' > /home/yourfile.csv
tab separated
hive -e 'select * from some_table' > /home/yourfile.csv
comma separated
hive -e 'select * from some_table' | sed 's/[\t]/,/g' > /home/yourfile.csv
How do I monitor all the open and aborted transactions in the system along with the transaction id and the transaction state?
Yes, Hive 0.13.0 and above version support SHOW TRANSACTIONS command
Explain what happens when you run a query in HIVE?
- The Query Planner analyzes the query and converts it to DAG (Directed Acyclic Graph) of Hadoop Map Reduce jobs
- The jobs get submitted to Hadoop cluster in the order implied by the DAG
- For simple query, only mappers run. The Input Output format is responsible for managing an input split and reading the data off HDFS. Next, the data flows into a layer called SerDe (Serializer Deserializer). In this case data as byte stream gets converted to a structured format by the deserializer part of the SerdDe.
- For aggregate queries, the Map Reduce jobs will also include reducers. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.
How jason files are loaded and queried in HIVE?
For exploring Hive’s capability in handling JSON data, I am using blog comments as an example. A blog comment has the following fields.
blogID
date
commenter name
comment
commenter email
commenter web site
Here are some typical JSON records for blog comment. The data has two level of nesting.
{ "blogID" : "FJY26J1333", "date" : "2012-04-01", "name" : "vpxnksu", "comment" : "good stuff", "contact" : { "email" : "vpxnksu@gmail.com", "website" : "vpxnksu.wordpress.com" } }
{ "blogID" : "VSAUMDFGSD", "date" : "2012-04-01", "name" : "yhftrcx", "comment" : "another comment",}
You may have noticed that not all comment records will have the commenter’s contact info. As we will see later, when we query, Hive return NULL for any missing filed in any record.
Create Table: My table is very simple with only one column, which holds the JSON data. You can think of the data being completely denormalized. I am using external table which means I am keeping my table data off Hive’s control. I am simply telling Hive where to look for the data. Here is the DDL for the table
CREATE EXTERNAL TABLE comments
(
value STRING
)
LOCATION '/Users/pghosh/test/comments';
Next, I loaded my data from a file into the Comments table as below
LOAD DATA LOCAL INPATH '/Users/pghosh/Workspace/comment.txt' OVERWRITE INTO TABLE comments;
The keyword OVERWRITE causes existing data for the table to be wiped out, before the load. Now that the data is loaded, we are ready for some queries.
Time for Query: My first query is going to find emails for all commenters for a given blog. You may want to such queries for the following scenario. We just had a new blog published, which happens to be very similar to a past blog. We want to find email for all commenters for the old blog, so that we can send them email with link to the new blog. It’s very likely that they will be interested in the new blog. Here is the query.
SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'contact') b
AS blogID, contact LATERAL VIEW json_tuple(b.contact, 'email', 'website') c
AS email, website WHERE b.blogID='64FY4D0B28';
In this query, the JSON support in Hive has kicked in. The LATERAL VIEW and json_tuple give us a way of navigating the JSON tree. The first instance gives us a virtual table with with two columns blogID and contact. We repeat the same process again to extract data from the next level of the JSON tree. This time it gives us another virtual table with the columns email and website.
The function json_tuple explodes a JSON node and return the child node values. The first argument is the node to explode. The rest of the arguments are the child node names.
Here is the query result. As expected, this query triggers one Map reduce job that does not include any reducer. The Hadoop job contains only the map task.
64FY4D0B28 NULL
64FY4D0B28 NULL
64FY4D0B28 lyau@hotmail.com
64FY4D0B28 NULL
64FY4D0B28 gyocok@gmail.com
64FY4D0B28 shrmf@gmail.com
Time taken: 21.459 seconds
Our next query is an aggregate query. We are interested in number of comments for each blog. This is a group by query as below. Here we navigate only one level in the JSON tree.
SELECT b.blogID, count(b.name) FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'name') b
AS blogID, name group by b.blogID;
Here is the result of the query. I am only showing partial result. The interesting thing to note is that that this query has mapped to one Hadoop job and the job includes reducers.
Another Way
Another way to handle JSON is to use a JSON SerDe. The external view of any Hive encapsulated data is always column and row oriented. However the internal on disk representation of data could be anything.
A Hive SerDe is the bridge between the internal representation and the external column and record oriented view. In this case it will do all the necessary mapping of JSON fields and columns. You have to add the following to your table DDL. With this approach, your table will have multiple columns.
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
Summing Up
Hopefully, this post has given you some idea about what Hive can do with JSO data. An interesting question that comes up is whether we still need MongoDB. MongoDB is a popular NOSQL document structured data base, with very powerful indexing and querying capabilities.
If you are doing mostly aggregate queries touching most of the data as in a data warehouse application, Hive is a better optional than MongoDB. However, if your use cases include queries touching only a small fraction of data, you may be better off staying with MongoDB or other equivalent document structured data base.
Update in Hive is an expensive operation, because it involves a full table overwrite. So for data that is frequently updated, MogoDb may be a better choice.
Explain Data Serialization with Avro in Hive.
• Created Avro table.
• Discussed the two techniques of loading data into the Avro table.
• Performed queries on the data inside the Avro table
• Downloaded the file with ‘.avro’ extension.
• Viewed the Avro schema.
• Converted the Avro schema into JSON.
• Viewed the contents of table in JSON format.
What is Avro?
Avro is one of the preferred data serialization systems because of its language neutrality.
Due to lack of language portability in Hadoop writable classes , avro becomes a natural choice because of its ability to handle multiple data formats which can be further processed by multiple languages.
Avro is also very much preferred for serializing the data in Hadoop.
It uses JSON for defining data types and protocols and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
By this, we can define Avro as a file format introduced with Hadoop to store data in a predefined format.This file format can be used in any of the Hadoop’s tools like Pig and Hive.
Implementing Avro file format in Hive
Before we take a look at how the Avro file format is implemented, let’s have a quick introduction to the Avro schema and how to create Avro records, Hive tables and much more.
Avro Schema
Avro relies on a schema. When Avro data is read, the schema used for writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates using it with dynamic scripting languages, since data together with its schema, is fully self-describing.
When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.
When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other’s full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.
Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.Using Avro, we can convert unstructured and semi-structured data into proper structured data using its schemas.
Creating a table to store the data in avro format
This process is initiated with the creation of JSON based schema to serialize data in a format that has a schema built in.
Avro has its own parser to return the provided schema as an object.
The created object allows us to create records with that schema.
We can create our schema inside the table properties while creating a Hive table
TBLPROPERTIES (‘avro.schema.literal’='{json schema here}’);
Now, lets create an Avro file format for olympic data.You can download the data from the file below
Creating a Hive Table:
create table olympic_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.literal'='{
"name": "my_record",
"type": "record",
"fields": [
{"name":"athelete", "type":"string"},
{"name":"age", "type":"int"},
{"name":"country", "type":"string"},
{"name":"year", "type":"string"},
{"name":"closing", "type":"string"},
{"name":"sport", "type":"string"},
{"name":"gold", "type":"int"},
{"name":"silver", "type":"int"},
{"name":"bronze", "type":"int"},
{"name":"total", "type":"int"}
]}');
Inside the tblproperties you can see the schema of the data. Every record inside the tblproperties will become a column. Here, ‘Name’ defines the column name and ‘type’ defines the datatype of the particular column.The above fields are for olympic data.
We can view the fields inside the table by using the command describe olympic_avro.
We have successfully created an avro table!
Here’s the in depth explanation behind the commands used for creating the table:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
We have used the above command while creating the table. This means that the data providing is serialized and de-serialized as well.If we have our own serialized data, there is no need to use this line. Else, the above line should be present, which acts as the default serialize for the data.
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
This line says that we are storing the file as AvroContainerInputFormat which is present in the default hive package as specified above. When using this line, the file will be stored with extension .avro, which is an extension of Avro.
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
This line says that the output of the file is also in Avro format and will have the extension .avro .
The next step is to insert the data into the table.
A default parser will act on the schema and creates an object for the schema, which in turn accommodates the type of data which suits the created schema.Now the data can be inserted into the Avro table created with the same schema.
Data Insertion into Avro Table: 
There are 2 methods by which the data can be inserted into an Avro table:
1. If we have a file with extension ‘.avro’ and the schema of the file is the same as what you specified, then you can directly import the file using the command
LOAD DATA LOCAL INPATH 'PATH OF THE FILE';
2. You can copy the contents of a previously created table into the newly created Avro table.
Let’s take a look at the second type of data insertion technique to import data into an Avro table. We will begin by creating a table which is delimited by tab space.
Text File Table Creation:
create tableolympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronzeINT,total INT) row format delimited fields terminated by '\t' stored as textfile;
Here, we are creating a table with the name ‘olympic’. The schema of the table is as specified and the data inside the input file is delimited by tab space. At the end of the command we have specified ‘stored as textfile’, which means that we are using a TEXTFILE format. The schema of the created table can be checked using the command describe olympic;
Now, the data can be loaded into the created table as follows:
load data local inpath ‘path of your file’into table olympic;
We have successfully loaded the data from the input file data into the table in TEXTFILE format.
Now, to copy the data from the ‘olympic’ table to the newly created Avro table, the command
insert overwrite table olympic_avro select * from olympic limit 20; can be used for the same.
This olympic data has some NULL values and the Avro schema cannot handle the null values by default. To make the Avro table work with NULL values, the schema of the table needs to be changed. We will see how this is done as well.
Now, let’s perform average operation on the Avro table using the command
select AVG(age) from olympic_avro;
You can see that we have successfully performed average operation on our newly created Avro table.
Output: The average age of the athletes is 23.65
We have successfully created an Avro table, imported data into the Avro and performed a average operation on the available data.
Handling NULL Values in Avro Table:
Like mentioned earlier, the schema of the table needs to be changed to make the table accept NULL values as well. In this case,the schema will be as shown below:
create table olympic_avro1
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.literal'='{
"namespace":"com.example.avro",
"name": "my_record",
"type": "record",
"fields": [
{"name":"athelete", "type":["string","null"],"default":null},
{"name":"age", "type":["int","null"],"default":0},
{"name":"country", "type":["string","null"],"default":null},
{"name":"year", "type":["string","null"],"default":null},
{"name":"closing", "type":["string","null"],"default":null},
{"name":"sport", "type":["string","null"],"default":null},
{"name":"gold", "type":["int","null"],"default":0},
{"name":"silver", "type":["int","null"],"default":0},
{"name":"bronze", "type":["int","null"],"default":0},
{"name":"total", "type":["int","null"],"default":0}
]}');
The difference between the normal Avro schema and the above specified schema is as follows:
In the type in above code we have given two values, one is the data type and the other is null, which means that if the value is the specified data type, it accepts the value, else, the value is NULL. It will consider the default value which we have given with attribute default.Here we have given the default value as null for string and 0(Zero) for int.
NOTE: When we specify the default value as null or 0,we need to specify it in double quotes.
Now, let’s create a table with the above specified schema which can accept NULL values as well.
We can see the columns created in the table using the below command:
describe olympic_avro1
This table will now accept NULL values as well.
Let’s try loading some data containing NULL values into the table. We know that the olympic data has some NULL values in it, so let us try to load the contents of the olympic table with the newly created table using the command
insert overwrite table olympic_avro1 select * from olympic
We have successfully loaded the contents of the olympic table containing NULL values into olympic_avro1 table.
Now, let’s try performing the AVERAGE operation on the newly created table using the below command:
select AVG(age) from olympic_avro1
We have successfully performed the AVERAGE operation on the entire olympic data.
Output: The average age of athletes is found to be 26.405433646812956
Converting Avro to JSON
To convert the Avro file into JSON we need to download a jar file called
‘avro-tools-1.7.5 jar’, which contains the option to convert the Avro file into JSON. You can download the jar file from the following link
avro-tools-1.7.5 jar
This jar file will also be downloaded in the ‘Downloads’ folder. From the terminal move to the ‘Downloads’ folder and then type the command with below syntax.
java -jar avro-tools-1.7.5.jar tojson 'avro file name' >newfilename.json
In our case the command will be as follows:
java -jar avro-tools-1.7.5.jar tojson 000000_0.avro >olympic.json
Now, a file with name ‘olympic.json’ will be created with the json schema of the Avro file. To see the contents of this json file use the below command:
cat olympic.json
Now, you can see the contents of your table with JSON format.
As mentioned above, the Avro parser has converted the schema into an object and the object will contain the data.In the above image, you can see that inside the schema there are the contents which says that the created object is holding our content.
Loading Files with .avro Extension into an Avro Table:
Let’s see how to load the file with extension ‘.avro’ directly into the Avro table as mentioned in the section ” Data Insertion into Avro Table ”
Let us create another Avro table with the name ‘olympic_avro2’ and the schema as follows:
create table olympic_avro2
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.literal'='{
"name": "my_record",
"type": "record",
"fields": [
{"name":"athelete", "type":"string"},
{"name":"age", "type":"int"},
{"name":"country", "type":"string"},
{"name":"year", "type":"string"},
{"name":"closing", "type":"string"},
{"name":"sport", "type":"string"},
{"name":"gold", "type":"int"},
{"name":"silver", "type":"int"},
{"name":"bronze", "type":"int"},
{"name":"total", "type":"int"}
]}');
We can see the columns we have created using the command describe olympic_avro2
Now, let’s load the .avro extension file, 000000_0.avro using the command
LOAD DATA LOCAL INPATH '/path of the file' into table olympic_avro2
We have successfully loaded the .avro file into the Avro table. Now, let’s perform the AVERAGE operation on the newly created ‘olympic_avro2’ table, using the below command:
select AVG(age) from olympic_avro2;
Output: We have got the AVERAGE age of athletes as 23.65, which is the same as what we got in the first case.
What are different types of meta store in Hive?
Hive Metastore:
- In Hive, metastore is the central repository to store metadata for hive tables/partitions. Any datastore that has a JDBC driver can be used as a metastore.
- By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk.
- There are 3 different ways to setup the metastore server using different Hive configurations:
- Embedded Metastore
- Local Metastore
- Remote Metastore
- Embedded metastore: Derby is the default database for the embedded metastore. It is a simple way to get started with Hive. But we can only have one Hive session open at a time that shares the same metastore. Trying to start a second session gives the error:
Failed to start database ‘metastore_db’
So, embedded metastore is not practical solution but can be used for unit tests.
- Local Metastore: In local metastore, a standalone database (MySQL or PostgreSQL or any other db with JDBC driver) is used as a metastore.
In this mode, multiple users can open connection to metastore and make SQL queries against it.
- Remote Metastore: In remote metastore, all Hive Clients will make a connection to a metastore server (MySQL or PostgreSQL or Oracle) which in turn queries the datastore. In this, metastore server run in a separate process to the Hive service.
Below is the high level architecture of three different Hive metastores:
Which classes are used by the hive to read and write hdfs files?
Following classes are used by Hive to read and write HDFS files •TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format. •SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.
What is the best way to implement row-based security in Hive?
Row-level security can be achieved by defining views with hard-coded permissions in Ranger.
An alternative available since Hive 1.2.0 is to filter dynamically based on the current user, with the current_user() function. This provides row-by-row security. One option to define the ACLs is via a permission table:
create table permission( username string, driverid string);
For example to secure the driver(driverid, drivername) table, you could create the following permission:
insert into permission values( jsmith, 25 );
Finally define the view by joining against it:
create view secure_driver AS select d.* from driver d inner join permissions p on d.driverid=p.driverid where username = current_user();
Create Hive views that would filter out rows and then use Ranger to grant that user access to that View (and not the underlying table).
Can you mention some lessons learned based on your experience?
- Look for hadoop job history and find the bottlenecks
- Understand EXPLAIN
- Query on small data set before doing it on very large dataset
- Partition the table if possible - based on filter conditions
- Tune joins - Try using map join
- try using spittable (like snappy) compression algorithm (use block compression over value - more efficient)
- Ensure intermediate and final O/P is always compressed
- Use small table on left side - use mapper only joins
- Tune hadoop - increase 10.sort.mb (100 by default) increase it to 250 or 500