Skip to main content

Command Palette

Search for a command to run...

Google BigQuery Interview Question

These question will help you prepare for the Google Data engineer and Google BigQuery Developer Interview

Updated
10 min read
Google BigQuery Interview Question

What is Authorized views and materialized views

Both authorized views and materialized views are powerful tools in BigQuery that offer improved data access and performance. However, they have key differences in their functionalities and applications:

Authorized Views:

  • Definition: Automatically created views based on queries with a WHERE clause.

  • Functionality: Pre-filter data based on the WHERE clause, improving query performance by reducing scanned data.

  • Read-only: Cannot be directly modified.

  • Benefits:

    • Efficient query performance: Reduces scanned data, leading to faster queries.

    • Simplified data access: Simplifies complex queries with pre-filtered data.

    • Data access control: Inherits permissions from the underlying table or view, offering easier control.

To create an Authorized View

Steps to set up Authorized View in BigQuery with Principle of least Privilege | PDE & PCA Concepts

Materialized Views:

  • Definition: User-created views that store pre-computed results of a query.

  • Functionality: Materialized views physically store data based on the query, significantly speeding up repeated queries with the same logic.

  • Updatable: This can be automatically or manually updated based on changes in the underlying data.

  • Benefits:

    • Extreme query performance: Provide the fastest query response for frequently used queries.

    • Reduced processing overhead: Eliminates repeated calculations for frequently used queries.

    • Scalability: Enables efficient handling of large datasets for repeated queries.

  • Example: SQL

      CREATE MATERIALIZED VIEW [view_name] AS
      SELECT [column1], [column2], SUM([column3]) AS total_value
      FROM [dataset_name.][table_name]
      GROUP BY [column1], [column2];
    

    This query creates a materialized view named [view_name] that stores pre-calculated total values for specific columns and groups.

Choosing the Right Option:

The best choice between authorized views and materialized views depends on your specific needs:

  • Frequently used queries: Materialized views are ideal for queries used repeatedly, significantly improving performance.

  • Large datasets: Materialized views offer faster processing for large datasets involved in repeated queries.

  • Data complexity: Materialized views are suitable for complex queries with aggregations and calculations.

  • Data update frequency: Authorized views are better if the underlying data changes frequently, ensuring the view reflects the latest information.

  • Data access control: Authorized views inherit permissions from the base data, simplifying control.

How to optimize table performance in Bigquery

Optimizing table performance in Google BigQuery involves various strategies, including optimizing table structure, using efficient queries, and leveraging features provided by BigQuery. Here are some tips to enhance table performance:

1. Partition and Cluster Tables:

  • Partitioning: Divide large tables into smaller, more manageable partitions based on a date or timestamp column. This can significantly reduce the amount of data scanned during queries.

  • Clustering: Use clustering to organize data within partitions based on one or more columns. Clustering improves query performance by minimizing the amount of data that needs to be read.

Anjan GCP Data Engineering - YouTube has explained both cluster and partition tables

Big Query Table Partitions with Examples

Big Query Clustered Tables with Examples

2. Use Appropriate Data Types:

Choose the most suitable data types for your columns to minimize storage and improve query performance. Avoid unnecessary use of STRING for numeric or boolean data.

3. Optimize Schema Design:

Normalize or denormalize your schema based on query patterns. Denormalization can reduce the need for JOIN operations and improve query performance.

4. \Avoid SELECT :*

Instead of selecting all columns using SELECT *, explicitly list only the columns you need. This reduces the amount of data read and improves query speed.

5. Control Query Output Size:

Use LIMIT and WHERE clauses to control the amount of data retrieved. Minimize the data scanned by only retrieving the necessary rows.

6. Use Batch Loading for Large Data:

When loading large amounts of data, consider using batch loading. Load data in larger chunks rather than row by row for better performance.

7. Optimize JOIN Operations:

If JOIN operations are necessary, use appropriate JOIN types and conditions. Ensure that the joined columns are properly indexed.

8. Optimize Query Syntax:

Write efficient queries. Review query execution plans using the EXPLAIN statement to identify opportunities for optimization.

9. Materialized Views (Limited):

Consider using views or materialized views to precompute and store aggregated data, but be aware that as of my last knowledge update, BigQuery doesn't have native materialized views. You may need to manually update tables or views periodically.

11. Monitor and Analyze Performance:

Use BigQuery's monitoring tools to analyze query performance. Review the Query Execution Details page in the Cloud Console for insights into query performance.

What is the slot in BigQuery

Slot = resource which is automatically provided by BigQuery

BigQuer Slot

How to query data from the Amazon s3 in BigQuery using BigQuery omni

BigQuery Omni allows querying data directly from external sources like Amazon S3 without needing to load it into BigQuery first. This offers several advantages, including:

  • Real-time access: Analyze data as soon as it's available in S3 without waiting for loading.

  • Cost-efficiency: Avoid unnecessary storage costs by querying data directly from the source.

  • Scalability: Process massive datasets without worrying about BigQuery storage limitations.

Here's how to query data from Amazon S3 to BigQuery Omni:

1. Set up BigQuery Omni:

  • Enable BigQuery Omni in your Google Cloud project.

  • Create a connection to your Amazon S3 account in BigQuery.

  • Specify IAM permissions for BigQuery to access your S3 data.

2. Create a BigLake table:

  • Define a BigLake table in BigQuery that references the data location in your S3 bucket.

  • Specify the data schema and format (e.g., CSV, JSON, Avro).

  • Optionally, configure partitioning and clustering for optimal query performance.

3. Write your SQL query:

  • Use standard BigQuery SQL syntax to query the data in the BigLake table.

  • BigQuery automatically fetches the data from S3 on-demand as needed for the query.

  • You can perform joins, aggregations, and other operations on the S3 data as if it were stored in BigQuery.

4. Export results:

  • Export the results to other destinations like Google Cloud Storage or BigQuery tables.

Here are some points to consider when querying data from Amazon S3 to BigQuery Omni:

  • Data access control: Ensure you have proper IAM permissions granted to BigQuery for accessing your S3 data.

  • Data format: BigQuery supports various data formats, but ensure the format used in S3 is compatible with your BigLake table definition.

  • Data size: Large datasets may require additional configuration for efficient querying and performance.

  • Query complexity: Complex queries may require more resources and potentially incur network costs for data retrieval from S3.

What are the different processes to export data from the BigQuery

BQ command to load the data from GCS to BigQuery

bq load --source_format [FORMAT] [DATASET.TABLE] [GCS_URI]

What operators used Airflow DAG to load data from GCS to BigQuery

Airflow offers several operators for loading data from Google Cloud Storage (GCS) to BigQuery. Here are the most commonly used ones:

GCStoBigQueryOperator:

This operator is the most common choice for loading data from GCS to BigQuery.

  • Features:

    • Supports various data formats like CSV, JSON, Avro, and Parquet.

    • Offers options to specify schema, compression, and write disposition.

    • Allows triggering load jobs based on file arrival or time intervals.

  • Example:

Python

from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import GCSToBigQueryOperator

with DAG(
    dag_id="gcs_to_bigquery",
    start_date=datetime(2023, 12, 13),
    schedule_interval="@daily",
) as dag:

    load_data = GCSToBigQueryOperator(
        task_id="load_data",
        bucket="my_bucket",
        source_object="data.csv",
        dataset="my_dataset",
        table="my_table",
    )

Difference in DELETE table and Truncate table in BigQuery

TRUNCATE and DELETE are both SQL commands used to remove data from a table, but they differ in their functionality and the way they accomplish the task.

  1. DELETE:

    • The DELETE statement is used to remove rows from a table based on a specified condition or without any condition (which would delete all rows).

    • It is a logged operation, meaning that each deleted row is recorded in the transaction log, allowing for the possibility of rolling back the transaction.

    • The DELETE statement is more flexible and can be used with a WHERE clause to specify a condition for deleting rows.

Example:

    DELETE FROM table_name WHERE condition;
  1. TRUNCATE:

    • The TRUNCATE statement is used to remove all rows from a table quickly and efficiently.

    • Unlike DELETE, TRUNCATE is not logged for each individual row; instead, it deallocates the data pages used by the table.

    • TRUNCATE is a faster operation than DELETE because it doesn't generate individual row deletion statements, and it doesn't log individual row deletions.

Example:

    TRUNCATE TABLE table_name;

Key Differences:

  • DELETE is a row-level operation, allowing you to delete specific rows based on a condition, whereas TRUNCATE is a table-level operation that removes all rows.

  • DELETE is slower than TRUNCATE because it logs each row deletion, making it suitable for smaller-scale row deletions where logging is necessary.

  • TRUNCATE is more efficient for removing all rows from a table, but it cannot be used when the table is referenced by a foreign key constraint or if it participates in an indexed view.

In summary, use DELETE when you need to selectively remove specific rows or when logging individual row deletions is necessary. Use TRUNCATE when you want to quickly remove all rows from a table

SQL

One of the main parts of the BigQuery interview

Question:

Write a query to get the below output

Tables

Employee table:
+----+-------+--------+--------------+
| id | name  | salary | departmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 70000  | 1            |
| 2  | Jim   | 90000  | 1            |
| 3  | Henry | 80000  | 2            |
| 4  | Sam   | 60000  | 2            |
| 5  | Max   | 90000  | 1            |
+----+-------+--------+--------------+
Department table:
+----+-------+
| id | name  |
+----+-------+
| 1  | IT    |
| 2  | Sales |
+----+-------+

Output

+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Jim      | 90000  |
| Sales      | Henry    | 80000  |
| IT         | Max      | 90000  |
+------------+----------+--------+

Solution:

SELECT d.name as Department,
       e.name as Employee,
       e.salary as Salary
FROM employee e
JOIN department d ON e.departmentId = d.id
WHERE e.salary >= (SELECT MAX(salary) FROM employee)
ORDER BY e.salary DESC;

Question:

Find the average salary for each department. Display the department name and the average salary.

Solution:

SELECT d.name AS Department,
       AVG(e.salary) AS AverageSalary
FROM employee e
JOIN department d ON e.departmentId = d.id
GROUP BY d.name;

Question:

List the employees who have a salary greater than the average salary across all departments. Display the employee name, salary, and department name.

WITH AvgSalaryCTE AS (
  SELECT AVG(salary) AS AvgSalary
  FROM employee
)

SELECT e.name AS Employee,
       e.salary AS Salary,
       d.name AS Department
FROM employee e
JOIN department d ON e.departmentId = d.id
CROSS JOIN AvgSalaryCTE
WHERE e.salary > AvgSalary;

Question:

Find the department with the highest total salary. Display the department name and the total salary.

SELECT d.name AS Department,
       SUM(e.salary) AS TotalSalary
FROM employee e
JOIN department d ON e.departmentId = d.id
GROUP BY d.name
ORDER BY TotalSalary DESC
LIMIT 1;

Question:

List the employees and their salaries in descending order of salary. For employees with the same salary, order them alphabetically by name.

Solution:

sqlCopy codeSELECT name AS Employee,
       salary AS Salary
FROM employee
ORDER BY salary DESC, name;

Question:

Find the department(s) where the average salary is greater than a specified value (e.g., 80000). Display the department name and the average salary.

Solution:

sqlCopy codeSELECT d.name AS Department,
       AVG(e.salary) AS AverageSalary
FROM employee e
JOIN department d ON e.departmentId = d.id
GROUP BY d.name
HAVING AVG(e.salary) > 80000;

Question:

What is the output of Table A and Table B from Inner Join, Left Outer Join and Full Join?

Table A:

1
2
2
3
4
6
NULL
NULL

Table B:

1
1
2
7
8
null

Solution:

Inner Join

Result:

1
2
2
3
4
6
NULL

Left Outer Join

Result:

1
2
2
3
4
6
NULL
NULL

Full Join

Result:

1
2
3
4
6
7
8
NULL
NULL


Question:

Select Concat('a',null,'b'):

Result:

null

In SQL, concatenating anything with NULL results in NULL. If you want to handle NULL values differently, you can use the COALESCE function or the CONCAT function with multiple arguments.

SELECT CONCAT('a', COALESCE(null, ''), 'b');
-- or
SELECT CONCAT('a', nullif('', null), 'b');

Both of these queries will result in 'ab'.

68 views

More from this blog

V

Vazid | Blog

9 posts

Multi-Cloud DevOps Engineer | Skilled in AWS, GCP, Azure cloud environments.