The goal of this course is to learn how to use Python and Spark to ingest, process, and analyze large volumes of data with different structures to generate insights and useful metrics from the data, walking through real-life examples and use cases.
- How to use Python to handle large volumes of data as an analyst or developer, working closely with data scientists
- Create useful metrics and statistics out of data (that was not possible before)
- Allow the students to be capable of creating its own analysis from large data sets, without having to rely on external or proprietary tools
Big Data is here to stay, as more and more companies see the value of storing data generated internally or not. But as with every new technology, it’s not enough to use it if no value is generated from it. Analyzing these datasets is a fundamental step into extracting the locked value in data. In this process, Python has been the most used programming language to process and analyze data, with its easy of use and very rich ecosystem and powerful libraries, and it’s still growing.
This course will cover an introduction to data manipulation in Python using Pandas, with generation of statistics, metrics, and plots. The next step is to do the analysis but now distributed on several computers, using Dask. Data aggregation for plots when all data does not fit into memory will be addressed. For really large problems and datasets, an introduction of Hadoop (HDFS and YARN) will be presented. The rest of the course will focus into Spark and its interaction with the previous tools presented.
By the end of the course, the student will be able to bootstrap its own Python environment, read large files and more data than can fit into memory, connect to Hadoop systems and manipulate data from there, generating statistics, metrics and graphs that represent the information in the dataset.
This approach differs from the more common approaches to Big Data problems that usually try to solve this problem using MapReduce or SQL-over-HDFS tools, such as Hive or Impala. The approach of building from the small case to the distributed one is different, using the similar interfaces between the presented stack to make it easier to understand and achieve the final goal.
What you will learn
- Read and transform data into different formats using Python
- Read large volumes of data on disk and manipulate it to generate basic statistics and metrics
- Handle distributed computing tasks over a cluster or local machines interconnected by a network
- Convert data from different sources to efficient formats for storage or querying, like Parquet
- Process, transform and aggregate data to generate clean datasets ready to be used in statistical analysis, visualization, and machine learning
- Explore data visually, enabling other analysts and decision makers to act on information extracted from data
Who This Book Is For
The audience is expected to know basic statistical measurements (mean, median, standard deviation, and so on), some kinds of graphs (line graph, scatter plot, and so on) and have working knowledge of relational databases. Python programming or other programming language experience is required. Having knowledge about distributed systems and/or Hadoop is useful.