Back to articles list Articles Cookbook
19 minutes read

How to Become a Big Data Engineer

What is a Big Data engineer, and how is their skill set different from a data engineer? In this article, we explore the tools and platforms you’ll need to master as a Big Data engineer.

To move from being a regular data engineer to a Big Data engineer, you need to acquire several new skills and learn to use several new tools. The good news is that Big Data still lets you use your good old SQL skills to manipulate and get information from data repositories.

But first, why would you want to move from regular data engineering to Big Data engineering? I’ll explain using an example.

For transporting small groups of people over short distances and without much haste, you can manage by driving a bus. But if you need to transport many people over long distances and in minimum time, you will have to learn to fly an airplane. Sure, it will be more difficult. You will have more responsibilities, but it will give you more satisfaction and you will earn much more money.

The same difference exists between the work of a conventional data engineer and that of a Big Data engineer.

What Is Big Data?

As you might imagine, Big Data refers to huge data sets. The exact definition of “huge” may vary depending on who you ask, but it is normal for Big Data repositories to hold well over 10 terabytes of data. And it’s increasingly common to hear of volumes reaching the order of petabytes (1 petabyte = 1,024 terabytes).

But Big Data is not just about high volume. It also includes a wide variety of data (structured, semi-structured, and unstructured) and high processing and access speeds. These qualities are commonly referred to as “the three Vs”: volume, velocity, and variety.

Two more attributes are usually added to the three V’s above. “Veracity”, or the reliability of the data, is important for avoiding incomplete, dirty (i.e. full of errors), or inaccurate information. “Value” refers to the importance of extracting valuable insights that enable informed decisions and generate business opportunities.

The aforementioned particularities mean that a Big Data engineer must use special frameworks in addition to conventional data engineering tools like SQL. If you are an SQL beginner, you can get started by taking an online course on SQL queries; if you want to master the language, following a complete SQL learning track that will teach you everything you need is the way to go.

Later in this article, we will discuss the main Big Data technologies. For now, let’s answer another question: what’s the job outlook for Big Data engineers?

Big Data Engineers Are In Demand

The good news for Big Data engineers is that Big Data job growth shows positive numbers. And the demand for workers skilled in Big Data far exceeds the supply. As a data engineer, you will probably be able to find a reasonably well-paying job. But Big Data job opportunities aim for much higher salaries; you can bet that the term “Big Data” will be increasingly present in the future of every data engineer.

To give you an idea, Glassdoor indicates that (as of March 2024), the average base salary for a data engineer with a traditional database job in the United States was $144,00 per year. The average base salary for a Big Data engineer, also in the United States, was $157,000 per year. These figures are only averages. The annual base salary of a Big Data engineer can climb up to $197,000 – and if you are lucky enough to get a Big Data engineer position at Google or Apple, your base salary could be over $230,000 per year.

All indications are that Big Data salary trends will continue to rise and move away from the conventional data engineering salary level.

Benefits of Big Data

If you wonder why companies are willing to pay so much more money for a Big Data engineer, the answer is that they also expect much more in return. Big Data is more than just large data sets – it is a tool that creates very high-value information, which can give companies a decisive advantage in their business or generate very big breakthroughs in their goals. To explain why, let’s look at a few examples:

  • Business: Big Data is an indispensable tool for understanding consumer behavior and anticipating market trends. The integration and correlation of different massive data sources – such as purchase details and support requests, credit reports, social media activity, and survey results – offers market insights that can only be obtained by collecting, processing, and analyzing massive amounts of information.
  • Healthcare: Big Data has become a vital tool for the healthcare industry. Real-time inpatient monitoring of sensor data and predictive risk analysis of discharged patients are just two examples of the many applications of Big Data in this area.
  • Government: Big Data is used for such things as identifying crime patterns, optimizing urban traffic, and managing environmental crises. It is also used to detect fraud in tax collection and refine taxpayer outreach programs.
  • Finance: Fraud detection is one of the main uses of Big Data in finance. Other uses include customer segmentation, cost optimization, and the generation of predictive financial models.
  • Mining, Oil & Gas: The intensive use of Big Data tools to process high volumes of seismic and microseismic information provides decisive advantages in the exploration and discovery of mineral and oil deposits.

What Does a Big Data Engineer Do?

A Big Data engineer is basically a software engineer who must also have a deep understanding of data engineering. Much of a Big Data engineer’s work involves designing and implementing software systems capable of collecting and processing gigantic volumes of data. Many of these systems involve Extract-Transform-Load (ETL) processes, which use sets of business rules to clean and organize the “raw” (unprocessed) data and prepare it for storage and use in analytics and machine learning (ML) processes.

Other tasks of a Big Data engineer include:

  • Designing architectures suitable for handling large volumes of data, aligned with business objectives.
  • Investigate new methods to improve data quality and security.
  • Create data solutions based on Big Data ecosystems (see below) and their development and visualization tools.
  • Collaborate with data analysts, data scientists, and other professionals to provide access to and visualizations of the results of Big Data processes. There are areas of responsibility in common between these professionals, so it is worth making a comparison between the work of a data analyst and that of a data engineer.

Skills and Tools Needed to Become a Big Data Engineer

A Big Data engineer should have a bachelor’s degree in a field related to information technology (IT) or data science. A master’s degree in Big Data systems or analytics can be of great help in getting higher-paying positions and more opportunities for career advancement. At the end of this article, I suggest some career paths to guide you on your road to becoming a Big Data engineer.

Beyond their degree, Big Data engineers must possess several essential skills. A thorough knowledge of algorithms, structures, and certain programming languages is critical. So is a basic understanding of distributed systems.

To work with large volumes of data and provide efficient access to its insights, the Big Data engineer needs far more diverse skills and tools than a conventional data engineer. A conventional data engineer may well make a career out of knowing only SQL and managing the most popular database management systems.

(By the way, if you plan to land a job as an SQL programmer, be sure to get prepared for the SQL assessment test. You may want to keep our SQL basics cheat sheet handy when you don’t remember the details of a specific SQL command.)

Besides having SQL skills for Big Data, a Big Data engineer must know about NoSQL databases, structured and unstructured data, data warehouses (and their variants, such as data marts and data lakes), and something known as Big Data Frameworks. Let’s look at how each of these skills influences the daily work of a Big Data engineer.

Big Data, SQL, and Relational Databases

Structured Query Language (SQL) was born with relational databases and is intimately linked to them. Every professional with conventional data engineering certifications knows that relational databases are designed with the main purpose of storing structured information and prioritizing the preservation of data integrity in transaction processing. This makes them unsuitable when the priorities become scalability, access speed, and real-time streams – which is what happens when moving from conventional databases to Big Data repositories.

Does that mean that learning how to work with SQL databases will have been in vain when you become a Big Data engineer? Absolutely not. Big Data engineers will continue using SQL in data analysis for many years to come.

So the future of the SQL language is bright. It is so widespread that it has become a de facto standard for data management – big or small. The new technologies created especially for Big Data cannot ignore this fact. That’s why they all offer data access tools that allow you to view Big Data repositories as if they had a relational database structure. Below we will see some of the SQL-based technologies used in Big Data environments.

NoSQL Databases

NoSQL (meaning “not only SQL”) is a family of database technologies that aim to overcome the limitations of relational databases and enable the velocity, volume, and variety of Big Data explained above. That’s why they are often preferable to relational databases for implementing Big Data solutions.

Although NoSQL databases vary in their implementation forms, they all have some shared characteristics:

  • Schema-less: NoSQL databases can store information without the need for the data structure to be predefined – unlike relational databases, where the schema (tables and their relationships) must be defined before they can be populated with information.
  • Scalability: Several NoSQL database architectures are designed with horizontal scalability as the main objective. This means that a NoSQL database can reside on a distributed file system (such as the Hadoop Distributed File System) that can grow in data volume simply by adding more nodes to it.
  • Real-time: Several NoSQL database implementations (e.g. Firebase, Redis, or DynamoDB) stand out for their high performance, scalability, and availability; this satisfies the basic needs of any real-time data application.

Despite their name, NoSQL databases use SQL dialects – reinforcing the idea that SQL is still relevant even when relational databases are not used.

Data Warehousing

Data warehouses emerged several decades ago as a way to collect information and centralize it for analytical processing. They have some similarities with Big Data: both technologies are designed to house large volumes of data and guarantee the veracity of the information. They also ensure that business value is obtained from these large volumes of information.

The difference between Big Data and data warehousing lies in the fact that data warehouses are designed to be built on relational schemas and fed with information coming from transactional systems (which are also based on relational databases). They are not prepared to handle unstructured information and even less to handle real-time data.

Although Big Data is a more modern and comprehensive technology than a data warehouse, the latter will not disappear or become obsolete. Both technologies are complementary and solve different use cases; if you need to perform analytical processing on structured data (e.g. sales or production information), a data warehouse is the most advisable solution. On the other hand, suppose you need to perform analytical processing on varied and unstructured information like emails, social network data, real-time application logs, or survey results. In that case, you should definitely aim for a Big Data solution.

There are also data warehouse technologies that operate on Big Data repositories, bridging the gap between the two technologies. One of the most popular is DBT, a data modeling/analytics tool that integrates with Cloud data providers and executes data transformation within the data warehouse.

Big Data Platforms and Frameworks

In relational databases, there is a central entity called a relational database management system or RDBMS that resides on a server and manages the information stored in databases with predefined structures (schemas). The RDBMS provides mechanisms for querying and updating the information residing in the databases, mostly through SQL commands. All responsibilities for data storage and utilization fall on the monolithic RDBMS.

In Big Data, responsibilities are distributed among different entities that are responsible for data storage, processing, coordination, and exploitation. A little more than a decade ago, this concept was materialized by the Apache Software Foundation in an ecosystem called Hadoop.

The fundamental part of any Big Data ecosystem (and Hadoop in particular) is a file system capable of storing massive amounts of information. This file system cannot rely on a single physical storage unit. Instead, it uses multiple nodes capable of working in coordination to provide scalability, redundancy, and fault tolerance. In Hadoop, this file system is called HDFS (Hadoop Distributed File System).

Handling such massive amounts of information requires a scheduling model based on tasks capable of running in parallel. Its execution is distributed among multiple processing nodes. In Hadoop, this programming model is called MapReduce and is based on Java technology.

With so many storage and processing nodes, there is one piece that cannot be missing: a coordinator or orchestrator to maintain order in the Big Data ecosystem and ensure that each task has the resources it needs. In Hadoop, this piece is called YARN (Yet Another Resource Negotiator).

In any Big Data ecosystem, these three basic pieces – storage, processing, and coordination – are completed with tools that make it possible to exploit the data residing in the ecosystem. Many of these tools were designed to run on top of Hadoop, complementing the ecosystem and improving some of its shortcomings.

As a side note, it is worth mentioning that Hadoop is the most “veteran” Big Data platform; it has been surpassed in several aspects by newer and more efficient tools. One of the main negative aspects of Hadoop that other technologies have tried to solve is its high complexity and costs to install, operate, tune, and scale.

How to Pilot a Big Data Platform

Let’s go back to the bus driver and airline pilot concept from the beginning of this article. If you’re a conventional data engineer, you’re probably used to starting each workday by opening your favorite SQL client, connecting to the databases you need to work with, and executing SQL commands. It’s almost like the bus driver turning the key to start the engine, opening the door for passengers to board, and transporting them to their destination.

But if you are a Big Data engineer, you’re at the helm of a gigantic data ecosystem. Data and processes are distributed across hundreds or thousands of nodes that must be carefully coordinated to deliver value to users. Think as an airline pilot: before you open the doors for passengers to board and begin their journey, you must ensure that several systems are fully operational and working in coordination. The lives of your passengers and your own depend on it.

 

Are you sure you want to go the way of the airline pilot?

In the Cockpit

If you’re still reading this, I imagine you answered yes to the previous question. Congratulations! Let’s look at the path to follow so that you can become the pilot of a Big Data machine.

In your cockpit, you will be able to find a huge number and variety of tools designed for data exploitation in Big Data repositories. Let’s take just one of them called Hive. It’s a framework that allows you to easily manipulate large amounts of data with a query language called HQL (HiveQL), which is based on SQL. In particular, HQL converts SQL commands into MapReduce jobs so that they can be executed on a Hadoop cluster.

The Hive query language bears a lot of similarities to standard SQL. In addition to the SELECT command with all its clauses (WHERE, GROUP BY, ORDER BY, LIMIT, etc.), it supports DML commands (such as INSERT, UPDATE, and DELETE) and DDL commands (such as CREATE, ALTER, and DROP) to manage a pseudo-table schema.

When a command is executed in Hive, such as any SELECT ... FROM ..., Hive does not return the results immediately. Instead, it sends a MapReduce job to YARN. YARN makes sure the job has the necessary resources (processing, storage, memory) and queues it for execution. Hive waits until the job has been completed before sending the query results back to you. To you, it will look as if you had executed that SELECT in your favorite SQL client. But underneath, there was a whole gigantic machinery servicing that simple request.

Big Data Tools and Platforms

We’ve said that Hadoop is an older platform and that it has been surpassed by other more modern and efficient ones. This does not mean that Hadoop is obsolete.

The good thing about Big Data is that its technologies were born in the open-source world, so the evolution of Big Data ecosystems is fast and constant. In addition to several large companies, there are communities of developers responsible for driving this evolution, building on existing solutions and constantly improving and complementing them.

Below are some of the tools and technologies that are emerging as the safest learning bets to gain a foothold in Big Data engineering.

Spark

Spark emerged in 2014 to address the performance limitations of MapReduce. Its main optimization was its ability to run on in-memory clusters instead of storing results on disk.

Spark supports several common languages (Python, Java, Scala, and R) and includes libraries for a variety of tasks, from SQL to streaming to machine learning. It can run on a laptop or on a cluster with thousands of servers. This makes it easy to get started with a small implementation and scale up to massive data processing across a wide range of applications.

Although Spark was designed to run on multiple cluster managers, historically it was primarily used with YARN and integrated into most Hadoop distributions. Over the years, there have been multiple major iterations of Spark. With the rise of Kubernetes as a popular scheduler mechanism, Spark has now become a first-class citizen of Kubernetes and has recently removed its dependency on Hadoop.

For the user, Apache Spark exposes an ecosystem of components tailored to different use cases. The core component is Spark Core, the Spark platform execution engine that provides the infrastructure for in-memory computing as well as basic I/O, scheduling, monitoring, and fault management functions. Around Spark Core are components with more specific functions, such as Spark SQL, Spark Streaming, MLlib, SparkR, and GraphX.

Apache Flink is a high throughput, low-latency data processing engine that prioritizes in-memory computation, high availability, the elimination of single points of failure, and horizontal scalability. Flink provides algorithms and data structures to support both bounded and unbounded processing, all through a single programming interface. Applications that process unbounded data run continuously, while those that process bounded data terminate their execution when they consume all their input data.

Storm

Apache Storm facilitates reliable processing of unlimited streams of data, doing for real-time processing what Hadoop did for batch processing. Its main qualities are simplicity, the ability to be used with any programming language, and a developer-friendly approach to data manipulation.

Storm’s use cases include real-time analytics, online machine learning, continuous computing, distributed RPC (remote procedure calls), and ETL. It is among the fastest Big Data execution engines, surpassing 1 million tuples processed per second per node. Its other qualities include high scalability, fault tolerance, guaranteed data processing, and ease of configuration and use.

Cassandra

Apache Cassandra is a column-oriented NoSQL database specially designed for Big Data. Thanks to the use of wide-column storage, it is capable of handling large amounts of data through clusters of commodity servers, providing high availability with no single points of failure.

Cassandra employs a peer-to-peer architecture that facilitates data distribution, allowing it to scale horizontally and easily handle increasing amounts of data and traffic. In addition, it offers scalable consistency, which means customers can choose the exact level of consistency they need for each operation.

Pig

Apache Pig is a high-level platform used for creating MapReduce programs running on top of Hadoop. It uses a simple scripting language called Pig Latin. This language allows developers to write complex data processing tasks concisely and simply, abstracting them from the complexities of MapReduce and providing some similarities to SQL.

Developers can extend Pig Latin’s functionality with UDFs (user-defined functions) that can be written in other languages like Java, Python, JavaScript, or Ruby. The Pig engine translates Pig Latin scripts into a series of MapReduce tasks that can run on Hadoop clusters, allowing them to handle large amounts of data.

BigQuery

BigQuery is a petabyte-scale, low-cost, serverless data warehouse that is part of the Google Cloud Platform. It is a fully managed service, which means that its users don’t need to worry about storage, processing or network resources.

Since its launch in 2010, Google Big Query has gained fans in organizations that need to analyze large amounts of information quickly and compare their results with publicly available statistical data. Today, many organizations demand BigQuery skills from their data job applicants.

An important part of BigQuery are its window functions, also called analytical functions or OVER functions; these have been part of the SQL standard since 2003. Learning how to use window functions in Google BigQuery is an important asset for a data analyst or similar role. Here are some useful resources:

Your Next Steps to Becoming a Big Data Engineer

As we discussed earlier, most data engineers have at least a bachelor’s degree in an IT or data field. You can then pursue a master’s degree in Big Data, choosing one of the dozens available online. Hundreds of Big Data courses and certifications are also available, many of them provided directly by technology companies such as Google or IBM. Most importantly, many of them are free.

It is also a good idea to keep your SQL knowledge up to date, for which I recommend you take advantage of our All Forever SQL Package. It allows you access to all LearnSQL.com’s current and future courses, which guarantees your knowledge of SQL’s main dialects and offers you thousands of interactive practice exercises.

Once you have basic knowledge of Big Data – even if you have not yet obtained enough diplomas and certifications to fill your resume – you can start accumulating experience working on real-world Big Data projects. To do that, you’ll need large repositories of Big Data, and that’s not something you can build on your own. Fortunately, there are plenty of free-to-use Big Datasets that you can draw on to put your knowledge into practice.

The world of Big Data is constantly evolving, so don’t think you can sit back and relax once you’ve accumulated enough degrees, certifications, and hours of practice. You’ll need to keep up to date, reading blogs, following Big Data influencers, and actively participating in communities of Big Data enthusiasts. Who knows – maybe you’ll become a data guru yourself that helps the world make better use of the gigantic amounts of information circulating on its networks!