pylivy¶
Livy is an open source REST interface
for interacting with Spark. pylivy
is a
Python client for Livy, enabling easy remote code execution on a Spark cluster.
Usage¶
The LivySession
class is the main interface
provided by pylivy
:
from livy import LivySession
LIVY_URL = 'http://spark.example.com:8998'
with LivySession(LIVY_URL) as session:
# Run some code on the remote cluster
session.run("filtered = df.filter(df.name == 'Bob')")
# Retrieve the result
local_df = session.read('filtered')
Authenticate requests sent to Livy by passing any requests Auth object to the
LivySession
. For example, to perform HTTP basic auth do:
from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth('username', 'password')
with LivySession(LIVY_URL, auth) as session:
session.run("filtered = df.filter(df.name == 'Bob')")
local_df = session.read('filtered')
API Documentation¶
livy.session¶
-
class
livy.session.
LivySession
(url, auth=None, verify=True, kind=<SessionKind.PYSPARK: 'pyspark'>, proxy_user=None, jars=None, py_files=None, files=None, driver_memory=None, driver_cores=None, executor_memory=None, executor_cores=None, num_executors=None, archives=None, queue=None, name=None, spark_conf=None, echo=True, check=True)[source]¶ Manages a remote Livy session and high-level interactions with it.
The py_files, files, jars and archives arguments are lists of URLs, e.g. [“s3://bucket/object”, “hdfs://path/to/file”, …] and must be reachable by the Spark driver process. If the provided URL has no scheme, it’s considered to be relative to the default file system configured in the Livy server.
URLs in the py_files argument are copied to a temporary staging area and inserted into Python’s sys.path ahead of the standard library paths. This allows you to import .py, .zip and .egg files in Python.
URLs for jars, py_files, files and archives arguments are all copied to the same working directory on the Spark cluster.
The driver_memory and executor_memory arguments have the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m, 2g).
See https://spark.apache.org/docs/latest/configuration.html for more information on Spark configuration properties.
- Parameters
url (
str
) – The URL of the Livy server.auth (
Union
[AuthBase
,Tuple
[str
,str
],None
]) – A requests-compatible auth object to use when making requests.verify (
Union
[bool
,str
]) – Either a boolean, in which case it controls whether we verify the server’s TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults toTrue
.kind (
SessionKind
) – The kind of session to create.proxy_user (
Optional
[str
]) – User to impersonate when starting the session.jars (
Optional
[List
[str
]]) – URLs of jars to be used in this session.py_files (
Optional
[List
[str
]]) – URLs of Python files to be used in this session.files (
Optional
[List
[str
]]) – URLs of files to be used in this session.driver_memory (
Optional
[str
]) – Amount of memory to use for the driver process (e.g. ‘512m’).driver_cores (
Optional
[int
]) – Number of cores to use for the driver process.executor_memory (
Optional
[str
]) – Amount of memory to use per executor process (e.g. ‘512m’).executor_cores (
Optional
[int
]) – Number of cores to use for each executor.num_executors (
Optional
[int
]) – Number of executors to launch for this session.archives (
Optional
[List
[str
]]) – URLs of archives to be used in this session.queue (
Optional
[str
]) – The name of the YARN queue to which submitted.name (
Optional
[str
]) – The name of this session.spark_conf (
Optional
[Dict
[str
,Any
]]) – Spark configuration properties.echo (
bool
) – Whether to echo output printed in the remote session. Defaults toTrue
.check (
bool
) – Whether to raise an exception when a statement in the remote session fails. Defaults toTrue
.
-
property
state
¶ The state of the managed Spark session.
- Return type
SessionState
-
run
(code)[source]¶ Run some code in the managed Spark session.
- Parameters
code (
str
) – The code to run.- Return type
Output
livy.client¶
-
class
livy.client.
LivyClient
(url, auth=None, verify=True)[source]¶ A client for sending requests to a Livy server.
- Parameters
url (
str
) – The URL of the Livy server.auth (
Union
[AuthBase
,Tuple
[str
,str
],None
]) – A requests-compatible auth object to use when making requests.verify (
Union
[bool
,str
]) – Either a boolean, in which case it controls whether we verify the server’s TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults toTrue
.
-
legacy_server
()[source]¶ Determine if the server is running a legacy version.
Legacy versions support different session kinds than newer versions of Livy.
- Return type
bool
-
create_session
(kind, proxy_user=None, jars=None, py_files=None, files=None, driver_memory=None, driver_cores=None, executor_memory=None, executor_cores=None, num_executors=None, archives=None, queue=None, name=None, spark_conf=None)[source]¶ Create a new session in Livy.
The py_files, files, jars and archives arguments are lists of URLs, e.g. [“s3://bucket/object”, “hdfs://path/to/file”, …] and must be reachable by the Spark driver process. If the provided URL has no scheme, it’s considered to be relative to the default file system configured in the Livy server.
URLs in the py_files argument are copied to a temporary staging area and inserted into Python’s sys.path ahead of the standard library paths. This allows you to import .py, .zip and .egg files in Python.
URLs for jars, py_files, files and archives arguments are all copied to the same working directory on the Spark cluster.
The driver_memory and executor_memory arguments have the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m, 2g).
See https://spark.apache.org/docs/latest/configuration.html for more information on Spark configuration properties.
- Parameters
kind (
SessionKind
) – The kind of session to create.proxy_user (
Optional
[str
]) – User to impersonate when starting the session.jars (
Optional
[List
[str
]]) – URLs of jars to be used in this session.py_files (
Optional
[List
[str
]]) – URLs of Python files to be used in this session.files (
Optional
[List
[str
]]) – URLs of files to be used in this session.driver_memory (
Optional
[str
]) – Amount of memory to use for the driver process (e.g. ‘512m’).driver_cores (
Optional
[int
]) – Number of cores to use for the driver process.executor_memory (
Optional
[str
]) – Amount of memory to use per executor process (e.g. ‘512m’).executor_cores (
Optional
[int
]) – Number of cores to use for each executor.num_executors (
Optional
[int
]) – Number of executors to launch for this session.archives (
Optional
[List
[str
]]) – URLs of archives to be used in this session.queue (
Optional
[str
]) – The name of the YARN queue to which submitted.name (
Optional
[str
]) – The name of this session.spark_conf (
Optional
[Dict
[str
,Any
]]) – Spark configuration properties.
- Return type
Session
-
get_session
(session_id)[source]¶ Get information about a session.
- Parameters
session_id (
int
) – The ID of the session.- Return type
Optional
[Session
]
-
delete_session
(session_id)[source]¶ Kill a session.
- Parameters
session_id (
int
) – The ID of the session.- Return type
None
-
list_statements
(session_id)[source]¶ Get all the statements in a session.
- Parameters
session_id (
int
) – The ID of the session.- Return type
List
[Statement
]
Contributing¶
Contributing to pylivy¶
Thanks for considering contributing to pylivy
!
Asking questions and reporting issues¶
If you have any questions on using pylivy
or would like to make a
suggestion on improving pylivy
, please open an issue on GitHub:
Submitting code changes¶
Before opening a PR, have a look at the information below on code formatting and tests. Tests will be run automatically on Travis and must pass before a PR can be merged.
Code formatting¶
Code must be formatted with Black (with a
line length of 79, as configured in pyproject.toml
), plus pass
Flake8 linting and mypy static type checks.
It’s recommend that you configure your editor to autoformat your code with Black and to highlight any Flake8 or mypy errors. This will help you catch them early and avoid disappointment when the tests are run later!
Running tests¶
pylivy
includes two types of code tests; unit tests and integration tests.
The unit tests test individual classes of the code base, while the integration
tests verify the behaviour of the library against an actual running Livy
server.
To run the unit tests, which run quickly and do not require a Livy server to be
running, first install tox
(a Python testing tool) if you do not already
have it:
pip install tox
then run:
tox -e py37
tox
will build the project into a package, prepare a Python virtual
environment with additional test dependencies, and execute the tests. You can
also run tests against Python 3.6 by replacing py37
with py36
in the
above command.
To run integration tests, you need to first start a Livy server to test against. For this purpose, I’ve prepared a Docker image that runs a basic Livy setup. To run it:
docker run --publish 8998:8998 acroz/livy
Then, in a separate shell, run the integration tests:
tox -e py37-integration
Again, you can replace py37
with py36
to change the Python version
used.
Adding tests¶
Any new contributions to the library should include appropriate tests, possibly including unit tests, integration tests, or both. Please get in touch by opening an issue if you’d like to discuss what makes sense.
Both unit tests and integration tests are written with the pytest testing framework. If you’re not familiar with it, I suggest having a look at their extensive documentation and examples first.