• Category
  • >Machine Learning
  • >Python Programming

A Gentle Introduction to PySpark

  • Bhumika Dutta
  • Dec 27, 2021
A Gentle Introduction to PySpark title banner

Today, Python is one of the most popular languages among coders and developers. PySpark is an excellent language to learn if we are already familiar with various Python packages to construct more scalable analyses and pipelines. 

 

It's a great language for executing large-scale exploratory data analysis, machine learning pipelines, and data platform ETLs.


 

PySpark

 

PySpark is a Python API for Apache Spark (written in Scala programming language) that was published to enable the collaboration of Apache Spark with Python. Furthermore, PySpark allows us to interact with Resilient Distributed Datasets (RDDs) in Apache Spark and Python. 

 

Some features of PySpark are:

 

  • Computation in memory

  • Parallelize for distributed processing

  • It's compatible with a variety of cluster managers 

  • Fault-tolerant

  • Unchangeable

  • Evaluation that is too easy

  • Persistence and cache

  • When utilizing DataFrames, there is built-in optimization.

  • ANSI SQL is supported.

 

Py4J is a popular package that is built into PySpark that allows Python to interact with JVM objects dynamically. PySpark comes with several libraries that can help you write more efficient programs. There are also several other libraries that are compatible. 

 

External libraries

 

According to data bricks, some of the external libraries that are compatible are:

 

  1. PySpark SQL:

 

A Python module for doing SQL-style analysis on large amounts of structured or semi-structured data. With PySparkSQL, we can also utilize SQL queries. It's also possible to connect it to Apache Hive. HiveQL may be used as well. 

 

PySparkSQL is a wrapper around the core of PySpark. PySparkSQL introduced the DataFrame, which is a tabular representation of structured data that looks like a table in a relational database management system.

 

  1. MLlib:

 

MLlib is Spark's machine learning (ML) library and is a wrapper around PySpark. To store and operate with data, this library employs the data parallelism approach. 

 

The MLlib library's machine-learning API is straightforward to use. For classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives, MLlib provides a wide range of machine-learning methods.

 

  1. GraphFrames:

 

GraphFrames is a graph processing toolkit that uses the PySpark core and PySparkSQL to provide a set of APIs for doing graph analysis quickly. It's designed for high-performance distributed computing.

 

 

PySpark Setup:

 

There are three different ways of setting up PySpark that we can apple according to our convenience. They are:

 

  1. Self-hosted environments can be created by clustering our own computers, either bare metal or virtual machines. Apache Ambari can be a good example of this type of setup.

  2. We can also set up with the help of cloud providers. Spark clusters are available from most cloud providers: AWS provides EMR and Google Cloud Platform has DataProc.

  3. Spark solutions are available from vendors such as Databricks and Cloudera, making it simple to get started with Spark.

 

 

Different cluster manager types supported by PySpark:

 

Spark by examples lists out four cluster managers that are supported by PySpark:

 

  1. Standalone is a simple cluster manager that comes with Spark and makes setting up a cluster a breeze.

 

  1. Apache Mesos is a cluster manager that can also run Hadoop MapReduce and PySpark applications.

 

  1. Hadoop YARN is the Hadoop 2 resource management. A cluster manager is the most commonly used term.

 

  1. Kubernetes is an open-source framework for automating containerized application deployment, scaling, and administration.

 

(Must read: 5 Clustering Methods and Applications)

 

 

Defining Spark Dataframes:

 

The Spark data frame is the most important data type in PySpark. This object functions similarly to data frames in R and Pandas and may be thought of as a table dispersed throughout a cluster. We have to perform different operations on Spark data frames if we want to do distributed computation using PySpark. 

 

When working with Spark, we can also use Pandas data frames by executing toPandas() on a Spark data frame, which yields a pandas object. However, especially when working with tiny data frames, this method should be avoided since it loads the entire object into memory on a single node. 

 

The only difference to exist between the two data frames is enthusiastic versus sluggish execution. Operations in PySpark are postponed until a result is required in the pipeline.

 

Pandas UDFs:

 

Pandas user-defined functions (UDFs) are one of Spark's features, allowing you to execute distributed computing with Pandas data frames in a Spark context.

 

In general, these UDFs function by partitioning a Spark data frame using a group by a statement, then sending each partition to a worker node and having it converted into a Pandas data frame, which is then sent to the UDF. 

 

The UDF then outputs a converted Pandas data frame that is concatenated with the other partitions before being translated back to a Spark data frame.

 

(Similar reading: A Beginner’s Tutorial on PyTorch)

 

 

Working of PySpark:

 

  1. First, we have to load a dataset into a data frame. We may apply transformations, do analysis and modeling, generate visualizations, and save the results after the data has been put into a data frame.

  2. Loading a CSV file with PySpark is difficult. Because there is no local storage in a distributed environment, a distributed file system such as HDFS, Databricks file store (DBFS), or S3 must be used to give the file's path.

  3. The following step is to load the CSV file into a Spark data frame, as seen below. To handle a CSV file, we have to supply the file's directory and a number of parameters to the read function.

  4. When reading CSV files into data frames, Spark uses an eager approach, which means that all of the data is put into memory before the next step is executed, but when reading parquet files, a lazy approach is utilized.

  5. When dealing with Spark, it's best to avoid eager operations. If we need to handle a large number of CSV files, we must first convert the data to parquet format before proceeding with the rest of the pipeline.

  6. It's not a good idea to write data to local storage while using PySpark, just like it's not a good idea to read data with Spark. We should instead use a distributed file system like S3 or HDFS.

  7. If we need the results in a CSV file, we will need to use a slightly different output step. The fact that all of the data will be pulled to a single node before being written to CSV is one of the primary distinctions in this technique.

 

 

Advantages of PySpark:

 

  1. PySpark is a general-purpose, in-memory, distributed processing engine that enables you to effectively process data in a distributed manner.

  2. PySpark applications are 100 times quicker than standard platforms.

  3. When it comes to data intake pipelines, PySpark has a lot of advantages.

  4. We can use PySpark to handle data from Hadoop HDFS, AWS S3, and a variety of other file systems.

  5. PySpark is also used to process real-time data utilizing Kafka and Streaming.

  6. PySpark streaming allows you to stream files from the file system as well as through a socket.

  7. Machine learning and graph libraries are built-in to PySpark.

 

 

Things to keep in mind:

 

Although this is an introductory blog to PySpark, towards data science has given some important advice that will be essential for people who are new to PySpark:

 

  • Because dictionaries are Python data types, the code may not be executable in a distributed environment. Consider adding another column to a data frame that may be used as a filter instead of utilizing keys to index entries in a dictionary.

  • All data is loaded into memory on the driver node when toPandas() is called, which prohibits operations from being conducted in a distributed manner. When data has previously been aggregated and we wish to utilize conventional Python charting tools, this method is appropriate, but it should not be used for huge data frames.

  • It's best to avoid eager operations that draw whole data frames into memory if we want our pipeline to be as scalable as feasible.  Reading CSVs is an eager process, therefore to design more scalable pipelines, we can store the data frame as parquet and then reload it from parquet.

  • To allow parallelized code execution, it's preferable to rebuild for-loop logic using the group by-apply paradigm. Focusing on this pattern in Python has resulted in cleaner code that is easier to transfer to PySpark.

 

To conclude, PySpark is a wonderful language to learn for data scientists since it allows for scalable analysis and machine learning pipelines. This course is designed for beginners and those who are already familiar with Apache Spark, Apache Hadoop, Scala, Hadoop Distributed File System (HDFS), and Python.

 

(Suggested read: An Essential Guide On PyCharm)

Latest Comments

  • bullsindia1877532969bd7334a57

    Jun 30, 2023

    Financing / Credit / Loan We offer financial loans and investment loans for all individuals who have special business needs. For more information contact us at via email: bullsindia187@gmail.com From 5000 € to 200.000 € From 200.000 € to 50.000.000 € Submit your inquiry Thank you

  • brenwright30

    May 11, 2024

    THIS IS HOW YOU CAN RECOVER YOUR LOST CRYPTO? Are you a victim of Investment, BTC, Forex, NFT, Credit card, etc Scam? Do you want to investigate a cheating spouse? Do you desire credit repair (all bureaus)? Contact Hacker Steve (Funds Recovery agent) asap to get started. He specializes in all cases of ethical hacking, cryptocurrency, fake investment schemes, recovery scam, credit repair, stolen account, etc. Stay safe out there! Hackersteve911@gmail.com https://hackersteve.great-site.net/