6th Jul 2023 12 minutes read Google BigQuery SQL Syntax: A Comprehensive Guide Jill Thornhill data analysis google bigquery Table of Contents What Is Google BigQuery? Data Warehouse Fast Searches Serverless Architecture Practicing BigQuery BigQuery Syntax Running Commands in the BigQuery Cloud Console Simple SQL Commands Some Advanced SQL Commands Additional Features of BigQuery Working with Google BigQuery: The Next Step What is Google BigQuery? What can it do for your organization? Is BigQuery SQL a valuable skill? Where can you learn it, and where can you get some practice? Very few organizations today are not computerized. Many processes now make use of the IoT (Internet of Things), where all kinds of devices are networked and continually feed real time data into computer systems. The result is a vast amount of data available for decision-making. This is often referred to as a data lake. The major challenges with this stream of data are how and where to store it and how best to make it available to decision-makers. Google's BigQuery is arguably one of the best answers to these challenges. BigQuery SQL skills are an important addition to your portfolio. If you want to learn BigQuery, where should you start? BigQuery supports ANSI SQL standards, so the first step is to gain basic SQL skills. If you don't already know SQL, LearnSQL.com's SQL Basics course will get you up to speed very quickly. You'll learn important database concepts; by the end of the course you'll be able to extract, aggregate, and analyze data from one or more tables to gain meaningful information. The course takes about 10 hours to complete; it includes 129 interactive exercises that teach you to solve real-world problems with SQL. What Is Google BigQuery? Google BigQuery is a serverless, highly scalable data warehousing solution. It can analyze terabytes of data in seconds and petabytes of data in minutes. But what does that actually mean? Data Warehouse Data is usually stored for one of two purposes: To support a business’ day-to-day operations To support decision making. The first of these is known as operational data, and it is often stored by several different computer systems as they carry out various tasks needed by the organization. The second is known as analytical data. Ideally, this data should be stored in a central place where it's accessible to anyone who needs information. This central place is known as the data warehouse, and it should be optimized to search and analyze large volumes of data quickly. You'd use a process known as ETL (Extract, Transform, Load) to transfer operational data into the warehouse. As a data warehouse, BigQuery has several advantages: It has excellent ETL tools. It uses the Colossus file system, which is designed for 'big'; more space can easily be added when needed. It uses SQL as its query language, so it's compatible with top business intelligence (BI) tools like Google Data Studio allows data to be presented in many different ways. Fast Searches BigQuery uses the Dremel search engine to process large volumes of data quickly. As I mentioned earlier, BigQuery can scan petabytes (a petabyte is roughly equivalent to a million gigabytes) in minutes. How does it do this? Each time you run a search on BigQuery, thousands of worker threads split the task between them; this allows enormous amounts of data to be scanned concurrently. The results are then amalgamated and returned to the user. Serverless Architecture If you use BigQuery, you don't hire a server. This means you don't have to manage the infrastructure, worry about backups and security, or figure out whether you need more hard disks. Your data becomes part of the great data lake – or ocean! This digital data ocean is split across countless servers worldwide. If more space is needed, Google allocates and manages it. You only pay for what you use. What about the processing power needed to carry out the computations? Google allocates processing power as and when it's needed. Google's high-speed networking software, Jupiter, is specially designed for fast communication between threads; you only pay for the processing slots that you use. And what about security? You can create views for your data and specify exactly who can see what. Since data is encrypted – both when stored and when in transit – it's safe from intruders. Practicing BigQuery If you already know some SQL, you can try out BigQuery for yourself right now. If you don't, you can still try out some of the examples in the next section to get a good feel of what it can do. But to get the full benefit of BigQuery, it's worth upping your SQL skills by enrolling in a comprehensive SQL course. LearnSQL.com's SQL from A to Z learning track includes 7 courses that take you from beginner to expert. Where and how can you practice these skills in BigQuery? Anyone can create a project on BigQuery for free. You only get charged when you want to load and process large amounts of data. The BigQuery sandbox lets you insert your own test data. Or you can access a wide range of public databases to play with. To get started, visit Google Cloud Console's BigQuery page. If you're not already logged into a Google account, you may be asked to do so. The first time you access this console, you'll see a notice like this: Click CREATE PROJECT and either give your project a name or accept Google's suggestion. Now you'll need some data to practice on. Fortunately, BigQuery has lots of public data available. To access it, you'll need to add the public datasets to your SQL Workspace. Towards the top left of the page, you'll see this: Click ADD. A pop-up screen will appear, giving you several choices. Click this one: You'll be asked for the project name. Enter bigquery-public-data and click 'STAR'. You should see this project listed in the project explorer on the left of the screen. And now you're all ready to start exploring. If you click the arrow next to bigquery-public-data, you'll see a list of all the databases in this project. And if you click a database, you'll see a list of its tables. The examples in the next section of this article are taken from the census_bureau_international database. If you've already had experience with SQL, that's all you need to get started. However, if you're an SQL newbie, you may like to get some practice in an environment where help is available and you're guided to finding solutions to problems. In that case, I'd recommend LearnSQL.com's practice track, where you'll have the chance to solve over 80 real-world challenges. BigQuery Syntax Now that you know how to set yourself up with BigQuery, let's have a look at the BigQuery syntax. But first, as you may already know, SQL has many different dialects unique to different database products. You can learn more about the different SQL dialects here. The American National Standards Institute (ANSI) defines the recommended standards for SQL; each dialect may or may not implement all of these standards. Many of them have their own additions to the language to allow more functionality. BigQuery originally had a very non-standard version of SQL that was unique to its needs. This is now called 'Legacy SQL' and you'd only need to learn it if you wanted to work for a company that still uses it. BigQuery now uses Google SQL, which sticks very closely to the ANSI standard. If you're already familiar with another dialect (such as MS SQL Server or MySQL), you'll find very few differences when using BigQuery. The major difference is that BigQuery has some extra statistical features and supports complex data structures like JSON and arrays. Let's have a look at a few examples of running queries in the Cloud Console. If you've never used SQL before and you're finding the examples difficult to follow, you might find this article on the most important SQL commands helpful. You can find the full syntax of all the Google SQL commands here. Running Commands in the BigQuery Cloud Console To the right of the project explorer in the BigQuery Cloud Console, you'll see a window where you can run Google SQL commands. It looks like this: You type your query in the query window, then click 'RUN' in the actions bar at the top. If you've typed a valid SQL command, you'll see the data you requested in the results window. If not, you'll see an error message instead. Simple SQL Commands Extracting and analyzing data uses only one SQL command: the SELECT statement. In its simplest form, the command is SELECT * FROM tablename, where tablename is the name of one of the tables in your database. The asterisk (*) tells it to return all the columns in the table. In Google SQL – unless the data is coming from your own project – you prefix the table name with its project and database name. In the examples, I'm going to access data from the census_bureau_international database in the bigquery-public-data project. One of the tables I'll use is called midyear_population. I'll need to refer to this table as: bigquery-public-data.census_bureau_international.midyear_population For convenience, you can give this rather long name an alias so you don't have to keep typing it. In the example below, I'm giving it the alias midyear. bigquery-public-data.census_bureau_international.midyear_population AS midyear When you're testing on tables that may be very large, it's a good idea to put a limit on the number of rows that will be returned. Otherwise you may exceed the limits of the free version of BigQuery. To see everything in the first 10 rows of this table, you'd type this command into the query window, then click RUN: SELECT * FROM bigquery-public-data.census_bureau_international.midyear_population AS midyear LIMIT 10 Try it out for yourself. Unless you've made a typo, your screen should look like this: You can use the scroll bar to scroll through your results. Now let's look at a few of the optional features of the SELECT statement. The WHERE clause lets you choose only rows that meet a certain criteria. Instead of showing all columns (i.e. using the asterisk), you can list the column names you want to see. In the above image, the column names (country_code, country_name, year, and midyear_population) are shown at the top of each column in the results window. You can specify the order of the rows in the result set using the ORDER BY Try running this query to see only country names and populations for 1975, sorted by country name. SELECT midyear.country_name, midyear_population FROM bigquery-public-data.census_bureau_international.midyear_population AS midyear WHERE year = 1975 ORDER BY country_name If you've done everything correctly, you should see this: SQL has lots more features. Two of the most useful are: Using aggregate functions, such as SUM() to find totals, AVG() to find averages, and MIN() and MAX() to find the smallest/largest values. Aggregates are usually used in conjunction with the GROUP BY clause, which defines the groups the aggregate functions apply to. Without grouping the data, you’ll see the aggregates for the data set as a whole. Retrieving data from two or more tables, provided they have one or more columns in common that can be used to join them. Let’s try some sample queries. First: SELECT midyear.country_name, AVG(midyear.midyear_population) FROM bigquery-public-data.census_bureau_international.midyear_population AS midyear GROUP BY country_name ORDER BY country_name This query shows the average midyear population for each country over the entire period. The next query takes data from two tables: the midyear_population table and the birth_death_growth_rates table. You can then see the growth rate alongside the population. You'll need to join these on two matching columns: the country and the year. Here’s the query, followed by its results: SELECT midyear.country_name, midyear.midyear_population, growth.growth_rate FROM bigquery-public-data.census_bureau_international.midyear_population AS midyear JOIN bigquery-public-data.census_bureau_international.birth_death_growth_rates AS growth ON growth.country_name = midyear.country_name AND growth.year = midyear.year WHERE midyear.year = 1975 ORDER BY country_name For all these queries, BigQuery uses ANSI standard syntax; the queries would work equally well in any SQL dialect. Some Advanced SQL Commands Going deeply into the syntax of complex SQL statements is beyond the scope of this article, but I'd like to include a couple of advanced queries that have the same syntax in Google SQL as they do in other dialects (such as MS SQL Server). Window functions are used to view individual rows against aggregates from the entire dataset. This example ranks countries by population – the country with the highest population ranks as 1. SELECT midyear.country_name, midyear.midyear_population, RANK() OVER (ORDER BY midyear_population DESC) FROM bigquery-public-data.census_bureau_international.midyear_population AS midyear WHERE year = 1975 ORDER BY country_name Your results should look like this: Other advanced features include GROUP BY extensions, which allow you to show aggregates at more than one level. Again, BigQuery's syntax follows ANSI standards; if you've learned these features in MS SQL, you'll be able to use them in BigQuery. Here's an example. This will show totals at the age and country levels. SELECT m.country_name, m.age, SUM(m.population) FROM bigquery-public-data.census_bureau_international.midyear_population_agespecific AS m WHERE m.year = 1975 GROUP BY ROLLUP (m.country_name, m.age) ORDER BY m.country_name, m.age The results look like this. The nulls indicate that that row of output is a total. Additional Features of BigQuery BigQuery has many features that are not found in most dialects. These make it particularly useful for dealing with big data coming from many different sources. It has several extra data types: Geography – Stores geographical co-ordinates and can be used to calculate the distance between two locations. Interval – Stores a time interval. JSON – A format used by many applications. Struct – A container to hold ordered fields. BigQuery also allows you to work with arrays. When working with very large datasets, it’s particularly useful to be able to extract random samples of data. Here's an example of taking a random sample from the midyear_population table. SELECT AVG (m.life_expectancy) FROM bigquery-public-data.census_bureau_international.mortality_life_expectancy AS m TABLESAMPLE SYSTEM (10 PERCENT) Instead of going through the entire table, which is large, this will take a statistical sample of 10% of the table. Since the sample is random, you might get slightly different results every time you run it. The output might look like this: You can learn more about the additional features of BigQuery SQL in the product documentation. Working with Google BigQuery: The Next Step The digital ocean grows larger every year, and more and more employers are looking for BigQuery skills. It's well worth becoming an expert in this area. But how? Enhance your SQL skills by taking some courses. LearnSQL.com has a wide range of courses covering everything from beginner to advanced topics. You can search for specific courses or just browse what we have on offer. LearnSQL.com’s practice track teaches you to be agile in solving all kinds of problems with SQL. You should also spend lots of time experimenting with the public datasets in Google Cloud Console. If you have your own business and would like to set up an SQL training program for your staff, we can help you do that, too. BigQuery is here to stay, so start becoming an expert right now! Tags: data analysis google bigquery