当前位置:网站首页>pyspark on hpc
pyspark on hpc
2022-06-23 22:50:00 【flavorfan】
Local internal cluster resources are limited , Simple data processing has gone 3 God .HPC There are many computing resources on , Out of the idea of eating the pot first and then the bowl , Consider making full use of shared resources first . Simple survey , It's not very complicated .
1 programme
spark use local Pattern
spark standalone It involves multi node communication , High complexity ; Multi task parallelism can be used to plan data fragmentation , One for each individual spark local Handle ; This avoids complex cluster construction . Through the requisition Mo node 、 many cpu、 Multi memory to achieve .
Give Way python The environment can find pyspark
This is essentially through env Environment variable implementation , The specific implementation is python Set up , One .bashrc or shell Set up .
2 step
1) install spark( It's decompression )
decompression spark-3.1.2-bin-hadoop3.2.tgz Go to the user directory , such as /users/username/tools/spark/spark
I used a soft connection , Consider switching between different versions later
cd /users/[username]/tools/ tar -zxvf spark-3.1.2-bin-hadoop3.2.tgz ln -s spark-3.1.2-bin-hadoop3.2 spark
2) stay python Configure... In the code , To use the pyspark
The following build environment and test code can be found in py Document and jupyter Medium test passed .
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/users/[username]/miniconda3/bin/python"
os.environ["SPARK_HOME"] = "/users/[username]/tools/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-10.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
# test code
import random
from pyspark import SparkContext
sc = pyspark.SparkContext(appName="myAppName")
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
NUM_SAMPLES = 1000000
count = sc.parallelize(range(0, NUM_SAMPLES)) \
.filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
sc.stop()3) adopt bashrc Or script configuration pyspark
To configure myspark.sh
#!/bin/sh export SPARK_HOME='/users/[username]/tools/spark' export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" export PYSPARK_PYTHON="/users//[username]/miniconda3/bin/python"
Put this in .bashrc, You don't need the above python To configure , Senseless use pyspark.
边栏推荐
- How to set secondary title in website construction what is the function of building a website
- Ant group's self-developed tee technology has passed the national financial technology product certification
- How to set the website address for website construction can the website be put on record
- How to set secondary title in website construction what is the function of secondary title
- Micro API gateway Middleware
- 游戏安全丨喊话CALL分析-写代码
- 专业“搬砖”老司机总结的 12 条 SQL 优化方案,非常实用!
- What is the API gateway architecture? What are the common gateway types?
- 【技术干货】蚂蚁办公零信任的技术建设路线与特点
- Change sql- Tencent cloud database tdsql elite challenge - essence Q & A
猜你喜欢

PHPMailer 发送邮件 PHP

应用实践 | Apache Doris 整合 Iceberg + Flink CDC 构建实时湖仓一体的联邦查询分析架构

混沌工程,了解一下

Application practice | Apache Doris integrates iceberg + Flink CDC to build a real-time federated query and analysis architecture integrating lake and warehouse

Save: software analysis, verification and test platform

In the eyes of the universe, how to correctly care about counting East and West?

解密抖音春节红包背后的技术设计与实践

脚本之美│VBS 入门交互实战

Opengauss Developer Day 2022 was officially launched to build an open source database root community with developers

Beauty of script │ VBS introduction interactive practice
随机推荐
Why is only one value displayed on your data graph?
Problems and solutions of MacOS installation go SQLite3
How to use xshell to log in to the server through the fortress machine? How does the fortress machine configure the tunnel?
脚本之美│VBS 入门交互实战
Industry 4.0 era: the rise of low code may bring about changes in the pattern of manufacturing industry
How to create a virtual server through a fortress machine? What are the functions of the fortress machine?
What are the application flow restrictions of API gateway framework?
Analysis and application of ThreadLocal source code
Ant group's self-developed tee technology has passed the national financial technology product certification
Game security - call analysis - write code
Redis6.x.x build rediscluster cluster
Talk about the problems and solutions of IT enterprise fixed assets management system
PHP时间戳
解密抖音春节红包背后的技术设计与实践
Judge whether the target class conforms to the section rule
Beauty of script │ VBS introduction interactive practice
如何利用数仓创建时序表
[tcapulusdb knowledge base] insert data example (TDR table)
Method of thread synchronization in kotlin
Understand the data consistency between MySQL and redis