当前位置:网站首页>pyspark on hpc
pyspark on hpc
2022-06-23 22:50:00 【flavorfan】
Local internal cluster resources are limited , Simple data processing has gone 3 God .HPC There are many computing resources on , Out of the idea of eating the pot first and then the bowl , Consider making full use of shared resources first . Simple survey , It's not very complicated .
1 programme
spark use local Pattern
spark standalone It involves multi node communication , High complexity ; Multi task parallelism can be used to plan data fragmentation , One for each individual spark local Handle ; This avoids complex cluster construction . Through the requisition Mo node 、 many cpu、 Multi memory to achieve .
Give Way python The environment can find pyspark
This is essentially through env Environment variable implementation , The specific implementation is python Set up , One .bashrc or shell Set up .
2 step
1) install spark( It's decompression )
decompression spark-3.1.2-bin-hadoop3.2.tgz Go to the user directory , such as /users/username/tools/spark/spark
I used a soft connection , Consider switching between different versions later
cd /users/[username]/tools/ tar -zxvf spark-3.1.2-bin-hadoop3.2.tgz ln -s spark-3.1.2-bin-hadoop3.2 spark
2) stay python Configure... In the code , To use the pyspark
The following build environment and test code can be found in py Document and jupyter Medium test passed .
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/users/[username]/miniconda3/bin/python"
os.environ["SPARK_HOME"] = "/users/[username]/tools/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-10.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
# test code
import random
from pyspark import SparkContext
sc = pyspark.SparkContext(appName="myAppName")
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
NUM_SAMPLES = 1000000
count = sc.parallelize(range(0, NUM_SAMPLES)) \
.filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
sc.stop()3) adopt bashrc Or script configuration pyspark
To configure myspark.sh
#!/bin/sh export SPARK_HOME='/users/[username]/tools/spark' export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" export PYSPARK_PYTHON="/users//[username]/miniconda3/bin/python"
Put this in .bashrc, You don't need the above python To configure , Senseless use pyspark.
边栏推荐
- Why is only one value displayed on your data graph?
- How to set the website construction title bar drop-down
- Command line setting the next boot to enter safe mode
- How to use data warehouse to create time series
- Start learning simple JS
- 游戏安全丨喊话CALL分析-写代码
- CS1.6 service startup tutorial
- 应用实践 | Apache Doris 整合 Iceberg + Flink CDC 构建实时湖仓一体的联邦查询分析架构
- [tcapulusdb knowledge base] [generic table] read data interface description
- Core features and technical implementation of FISCO bcos v3.0
猜你喜欢

Section 30 high availability (HA) configuration case of Tianrongxin topgate firewall

解密抖音春节红包背后的技术设计与实践

Save: software analysis, verification and test platform

Opengauss Developer Day 2022 was officially launched to build an open source database root community with developers

Slsa: accelerator for successful SBOM

【技术干货】蚂蚁办公零信任的技术建设路线与特点

Beauty of script │ VBS introduction interactive practice

应用实践 | Apache Doris 整合 Iceberg + Flink CDC 构建实时湖仓一体的联邦查询分析架构

Chaos engineering, learn about it

為什麼你的數據圖譜分析圖上只顯示一個值?
随机推荐
How does the national standard gb28181 security video platform easygbs download device video through the interface?
[tcapulusdb knowledge base] reading data example (TDR table)
SAVE: 软件分析验证和测试平台
What is the difference between RosettaNet, EDI ANSI X12 and EDIFACT
Save: software analysis, verification and test platform
5 minutes to explain what is redis?
How to set the website address for website construction can the website be put on record
反序列化——php反序列化
Low code technology
Go build command (go language compilation command) complete introduction
Micro build low code tutorial - variable definition
Statistics of clinical trials - Calculation of tumor trial endpoint
Get and post are nothing more than TCP links in nature?
How to set the protective strip in the barcode
How does data Vientiane CI | app quickly integrate HLS encryption to prevent video leakage?
Problems and solutions of MacOS installation go SQLite3
2022年性价比高的商业养老保险产品排名
Heat transfer oil electric heater
Remember a compose version of Huarong Road, you deserve it!
Core features and technical implementation of FISCO bcos v3.0