TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

Last update: Jan 04, 2023

Overview

TensorFlowOnSpark

TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters.

By combining salient features from the TensorFlow deep learning framework with Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.

It enables both distributed TensorFlow training and inferencing on Spark clusters, with a goal to minimize the amount of code changes required to run existing TensorFlow programs on a shared grid. Its Spark-compatible API helps manage the TensorFlow cluster with the following steps:

Startup - launches the Tensorflow main function on the executors, along with listeners for data/control messages.
Data ingestion
- InputMode.TENSORFLOW - leverages TensorFlow's built-in APIs to read data files directly from HDFS.
- InputMode.SPARK - sends Spark RDD data to the TensorFlow nodes via a TFNode.DataFeed class. Note that we leverage the Hadoop Input/Output Format to access TFRecords on HDFS.
Shutdown - shuts down the Tensorflow workers and PS nodes on the executors.

Background
Install
Usage
API
Contribute
License

Background

TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.

TensorFlowOnSpark provides some important benefits (see our blog) over alternative deep learning solutions.

Easily migrate existing TensorFlow programs with <10 lines of code change.
Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard.
Server-to-server direct communication achieves faster learning when available.
Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow.
Easily integrate with your existing Spark data processing pipelines.
Easily deployed on cloud or on-premise and on CPUs or GPUs.

Install

TensorFlowOnSpark is provided as a pip package, which can be installed on single machines via:

# for tensorflow>=2.0.0
pip install tensorflowonspark

# for tensorflow<2.0.0
pip install tensorflowonspark==1.4.4

For distributed clusters, please see our wiki site for detailed documentation for specific environments, such as our getting started guides for single-node Spark Standalone, YARN clusters and AWS EC2. Note: the Windows operating system is not currently supported due to this issue.

Usage

To use TensorFlowOnSpark with an existing TensorFlow application, you can follow our Conversion Guide to describe the required changes. Additionally, our wiki site has pointers to some presentations which provide an overview of the platform.

Note: since TensorFlow 2.x breaks API compatibility with TensorFlow 1.x, the examples have been updated accordingly. If you are using TensorFlow 1.x, you will need to checkout the v1.4.4 tag for compatible examples and instructions.

API

API Documentation is automatically generated from the code.

Contribute

Please join the TensorFlowOnSpark user group for discussions and questions. If you have a question, please review our FAQ before posting.

Contributions are always welcome. For more information, please see our guide for getting involved.

License

The use and distribution terms for this software are covered by the Apache 2.0 license. See LICENSE file for terms.

Comments

Bump protobuf-java from 3.16.1 to 3.16.3
Bumps protobuf-java from 3.16.1 to 3.16.3.

Release notes

Sourced from protobuf-java's releases.

Protobuf Release v3.16.3

Java

Refactoring java full runtime to reuse sub-message builders and prepare to migrate parsing logic from parse constructor to builder.

Move proto wireformat parsing functionality from the private "parsing constructor" to the Builder class.

Change the Lite runtime to prefer merging from the wireformat into mutable messages rather than building up a new immutable object before merging. This way results in fewer allocations and copy operations.

Make message-type extensions merge from wire-format instead of building up instances and merging afterwards. This has much better performance.

Fix TextFormat parser to build up recurring (but supposedly not repeated) sub-messages directly from text rather than building a new sub-message and merging the fully formed message into the existing field.

This release addresses a Security Advisory for Java users

Commits

b8c2488 Updating version.json and repo version numbers to: 16.3

42e47e5 Refactoring Java parsing (3.16.x) (#10668)

98884a8 Merge pull request #10556 from deannagarcia/3.16.x

450b648 Cherrypick ruby fixes for monterey

b17bb39 Merge pull request #10548 from protocolbuffers/3.16.x-202209131829

c18f5e7 Updating changelog

6f4e817 Updating version.json and repo version numbers to: 16.2

a7d4e94 Merge pull request #10547 from deannagarcia/3.16.x

55815e4 Apply patch

152d7bf Update version.json with "lts": true (#10535)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Evalator hangs while training
Environment:

Python version 3.7

Spark version 2.4

TensorFlow version 2.5

TensorFlowOnSpark version 2.2.3

Cluster version hadoop

Describe the bug: I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output.
opened by jiqiujia 1
do we support scala & java code write tensorflow model with tenorflow-core-api ?
Environment:

Python version [e.g. 2.7, 3.6]

Spark version [e.g. 2.1, 2.3.1]

TensorFlow version [e.g. 1.5, 1.9.0]

TensorFlowOnSpark version [e.g. 1.1, 1.3.2]

Cluster version [e.g. Standalone, Hadoop 2.8, CDH5]

Describe the bug: A clear and concise description of what the bug is.

Logs: If applicable, add logs to help explain your problem. Note: errors may not be fully described in the driver/console logs. Make sure to check the executor logs for possible root causes.

Spark Submit Command Line: If applicable, add your spark-submit command line.
opened by mullerhai 3
Get stuck at "Added broadcast_0_piece0 in memory on" while runing Spark standalone cluster
Environment:

Python version 3.6.9

Spark version 3.0.0

TensorFlow version 2.6.2

TensorFlowOnSpark version 2.2.4

Cluster version Standalone, Hadoop 3.1.3

Describe the bug: Get stuck at "INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on" while runing Spark standalone cluster while Training MNIST using Keras.

Logs: 2021-12-27 10:51:01,579 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2021-12-27 10:51:03,012 INFO spark.SparkContext: Running Spark version 3.0.0 2021-12-27 10:51:03,044 INFO resource.ResourceUtils: ============================================================== 2021-12-27 10:51:03,045 INFO resource.ResourceUtils: Resources for spark.driver:

2021-12-27 10:51:03,045 INFO resource.ResourceUtils: ============================================================== 2021-12-27 10:51:03,045 INFO spark.SparkContext: Submitted application: mnist_keras 2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing view acls to: amax 2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing modify acls to: amax 2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing view acls groups to: 2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing modify acls groups to: 2021-12-27 10:51:03,081 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set() 2021-12-27 10:51:03,232 INFO util.Utils: Successfully started service 'sparkDriver' on port 45175. 2021-12-27 10:51:03,255 INFO spark.SparkEnv: Registering MapOutputTracker 2021-12-27 10:51:03,275 INFO spark.SparkEnv: Registering BlockManagerMaster 2021-12-27 10:51:03,289 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 2021-12-27 10:51:03,289 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 2021-12-27 10:51:03,291 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat 2021-12-27 10:51:03,298 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-1a9cd977-b478-4add-bb24-887f1cb3e057 2021-12-27 10:51:03,311 INFO memory.MemoryStore: MemoryStore started with capacity 14.2 GiB 2021-12-27 10:51:03,320 INFO spark.SparkEnv: Registering OutputCommitCoordinator 2021-12-27 10:51:03,401 INFO util.log: Logging initialized @2665ms to org.sparkproject.jetty.util.log.Slf4jLog 2021-12-27 10:51:03,443 INFO server.Server: jetty-9.4.z-SNAPSHOT; built: 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 1.8.0_212-b10 2021-12-27 10:51:03,457 INFO server.Server: Started @2722ms 2021-12-27 10:51:03,474 INFO server.AbstractConnector: Started [email protected]{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 2021-12-27 10:51:03,475 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 2021-12-27 10:51:03,495 INFO handler.ContextHandler: Started [email protected]{/jobs,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,496 INFO handler.ContextHandler: Started [email protected]{/jobs/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,497 INFO handler.ContextHandler: Started [email protected]{/jobs/job,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,498 INFO handler.ContextHandler: Started [email protected]{/jobs/job/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,499 INFO handler.ContextHandler: Started [email protected]{/stages,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,499 INFO handler.ContextHandler: Started [email protected]{/stages/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,500 INFO handler.ContextHandler: Started [email protected]{/stages/stage,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,501 INFO handler.ContextHandler: Started [email protected]{/stages/stage/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,501 INFO handler.ContextHandler: Started [email protected]{/stages/pool,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started [email protected]{/stages/pool/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started [email protected]{/storage,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started [email protected]{/storage/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,503 INFO handler.ContextHandler: Started [email protected]{/storage/rdd,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,503 INFO handler.ContextHandler: Started [email protected]{/storage/rdd/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started [email protected]{/environment,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started [email protected]{/environment/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started [email protected]{/executors,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,505 INFO handler.ContextHandler: Started [email protected]{/executors/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,505 INFO handler.ContextHandler: Started [email protected]{/executors/threadDump,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,506 INFO handler.ContextHandler: Started o.s.j.s.Serv[email protected]{/executors/threadDump/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,510 INFO handler.ContextHandler: Started [email protected]{/static,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,510 INFO handler.ContextHandler: Started [email protected]{/,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,511 INFO handler.ContextHandler: Started [email protected]{/api,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,511 INFO handler.ContextHandler: Started [email protected]{/jobs/job/kill,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,512 INFO handler.ContextHandler: Started [email protected]{/stages/stage/kill,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,513 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://amax11:4040 2021-12-27 10:51:03,654 INFO client.StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.1.11:7077... 2021-12-27 10:51:03,711 INFO client.TransportClientFactory: Successfully created connection to /192.168.1.11:7077 after 36 ms (0 ms spent in bootstraps) 2021-12-27 10:51:03,793 INFO cluster.StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20211227105103-0014 2021-12-27 10:51:03,794 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/0 on worker-20211226140545-192.168.1.11-43273 (192.168.1.11:43273) with 1 core(s) 2021-12-27 10:51:03,796 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/0 on hostPort 192.168.1.11:43273 with 1 core(s), 27.0 GiB RAM 2021-12-27 10:51:03,796 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/1 on worker-20211226140545-192.168.1.7-36031 (192.168.1.7:36031) with 1 core(s) 2021-12-27 10:51:03,797 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/1 on hostPort 192.168.1.7:36031 with 1 core(s), 27.0 GiB RAM 2021-12-27 10:51:03,797 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/2 on worker-20211226140545-192.168.1.5-36787 (192.168.1.5:36787) with 1 core(s) 2021-12-27 10:51:03,798 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/2 on hostPort 192.168.1.5:36787 with 1 core(s), 27.0 GiB RAM 2021-12-27 10:51:03,801 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42229. 2021-12-27 10:51:03,801 INFO netty.NettyBlockTransferService: Server created on amax11:42229 2021-12-27 10:51:03,803 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 2021-12-27 10:51:03,812 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, amax11, 42229, None) 2021-12-27 10:51:03,815 INFO storage.BlockManagerMasterEndpoint: Registering block manager amax11:42229 with 14.2 GiB RAM, BlockManagerId(driver, amax11, 42229, None) 2021-12-27 10:51:03,818 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, amax11, 42229, None) 2021-12-27 10:51:03,819 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, amax11, 42229, None) 2021-12-27 10:51:03,822 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/0 is now RUNNING 2021-12-27 10:51:03,822 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/2 is now RUNNING 2021-12-27 10:51:03,823 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/1 is now RUNNING 2021-12-27 10:51:03,934 INFO handler.ContextHandler: Started [email protected]{/metrics/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:04,320 INFO history.SingleEventLogFileWriter: Logging events to hdfs://amax11:8020/spark-sa-history/app-20211227105103-0014.inprogress 2021-12-27 10:51:04,521 INFO cluster.StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 args: Namespace(batch_size=64, cluster_size=3, epochs=3, export_dir='/home/amax/TOS/TensorFlowOnSpark/mnist_export', images_labels='/home/amax/TOS/TensorFlowOnSpark/data/mnist/csv/train', model_dir='/home/amax/TOS/TensorFlowOnSpark/mnist_model', tensorboard=False) 2021-12-27 10:51:04,712 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KiB, free 14.2 GiB) 2021-12-27 10:51:04,759 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 14.2 GiB) 2021-12-27 10:51:04,762 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on amax11:42229 (size: 27.4 KiB, free: 14.2 GiB) 2021-12-27 10:51:04,765 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:0 2021-12-27 10:51:04,916 INFO (MainThread-11873) Reserving TFSparkNodes 2021-12-27 10:51:04,917 INFO (MainThread-11873) cluster_template: {'chief': [0], 'worker': [1, 2]} 2021-12-27 10:51:04,919 INFO (MainThread-11873) Reservation server binding to port 0 2021-12-27 10:51:04,919 INFO (MainThread-11873) listening for reservations at ('192.168.1.11', 37823) 2021-12-27 10:51:04,919 INFO (MainThread-11873) Starting TensorFlow on executors 2021-12-27 10:51:04,924 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 27648, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 2021-12-27 10:51:04,928 INFO (MainThread-11873) Waiting for TFSparkNodes to start 2021-12-27 10:51:04,928 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:04,983 INFO spark.SparkContext: Starting job: foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327 2021-12-27 10:51:04,999 INFO scheduler.DAGScheduler: Got job 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) with 3 output partitions 2021-12-27 10:51:04,999 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) 2021-12-27 10:51:05,000 INFO scheduler.DAGScheduler: Parents of final stage: List() 2021-12-27 10:51:05,000 INFO scheduler.DAGScheduler: Missing parents: List() 2021-12-27 10:51:05,003 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[3] at foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327), which has no missing parents 2021-12-27 10:51:05,012 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 17.1 KiB, free 14.2 GiB) 2021-12-27 10:51:05,014 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.7 KiB, free 14.2 GiB) 2021-12-27 10:51:05,015 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on amax11:42229 (size: 11.7 KiB, free: 14.2 GiB) 2021-12-27 10:51:05,015 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1200 2021-12-27 10:51:05,025 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (PythonRDD[3] at foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) (first 15 tasks are for partitions Vector(0, 1, 2)) 2021-12-27 10:51:05,026 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks 2021-12-27 10:51:05,245 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.5:52680) with ID 2 2021-12-27 10:51:05,279 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.7:40504) with ID 1 2021-12-27 10:51:05,286 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.11:44508) with ID 0 2021-12-27 10:51:05,343 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.5:33393 with 14.2 GiB RAM, BlockManagerId(2, 192.168.1.5, 33393, None) 2021-12-27 10:51:05,361 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.7:38801 with 14.2 GiB RAM, BlockManagerId(1, 192.168.1.7, 38801, None) 2021-12-27 10:51:05,363 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:35421 with 14.2 GiB RAM, BlockManagerId(0, 192.168.1.11, 35421, None) 2021-12-27 10:51:05,391 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.5, executor 2, partition 0, PROCESS_LOCAL, 7337 bytes) 2021-12-27 10:51:05,398 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.1.7, executor 1, partition 1, PROCESS_LOCAL, 7337 bytes) 2021-12-27 10:51:05,399 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.1.11, executor 0, partition 2, PROCESS_LOCAL, 7337 bytes) 2021-12-27 10:51:05,559 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.7:38801 (size: 11.7 KiB, free: 14.2 GiB) 2021-12-27 10:51:05,560 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.11:35421 (size: 11.7 KiB, free: 14.2 GiB) 2021-12-27 10:51:05,566 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.5:33393 (size: 11.7 KiB, free: 14.2 GiB) 2021-12-27 10:51:05,930 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:06,931 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:07,932 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:08,934 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:09,935 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:10,936 INFO (MainThread-11873) all reservations completed 2021-12-27 10:51:10,937 INFO (MainThread-11873) All TFSparkNodes started 2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 1, 'host': '192.168.1.7', 'job_name': 'worker', 'task_index': 0, 'port': 36945, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-8lelocpu/listener-c8uef51e', 'authkey': b'n/\x82\x08F\x94L\xb0\xb5T\xd0\xf4\xb6\n\xb2y'} 2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'} 2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 0, 'host': '192.168.1.5', 'job_name': 'chief', 'task_index': 0, 'port': 44093, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-sevn96me/listener-coe66f9n', 'authkey': b'3\xd8\x85\xe7\x06UKa\x8b\xcb\xe6\x11\n\xa1)C'} 2021-12-27 10:51:10,937 INFO (MainThread-11873) Feeding training data 2021-12-27 10:51:11,023 INFO mapred.FileInputFormat: Total input files to process : 10 2021-12-27 10:51:11,061 INFO spark.SparkContext: Starting job: collect at PythonRDD.scala:168 2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Got job 1 (collect at PythonRDD.scala:168) with 30 output partitions 2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at PythonRDD.scala:168) 2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Parents of final stage: List() 2021-12-27 10:51:11,065 INFO scheduler.DAGScheduler: Missing parents: List() 2021-12-27 10:51:11,066 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (PythonRDD[6] at RDD at PythonRDD.scala:53), which has no missing parents 2021-12-27 10:51:11,078 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KiB, free 14.2 GiB) 2021-12-27 10:51:11,079 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.3 KiB, free 14.2 GiB) 2021-12-27 10:51:11,080 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on amax11:42229 (size: 7.3 KiB, free: 14.2 GiB) 2021-12-27 10:51:11,080 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1200 2021-12-27 10:51:11,081 INFO scheduler.DAGScheduler: Submitting 30 missing tasks from ResultStage 1 (PythonRDD[6] at RDD at PythonRDD.scala:53) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 2021-12-27 10:51:11,081 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 30 tasks 2021-12-27 10:51:15,330 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, 192.168.1.5, executor 2, partition 0, ANY, 7536 bytes) 2021-12-27 10:51:15,333 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 9956 ms on 192.168.1.5 (executor 2) (1/3) 2021-12-27 10:51:15,341 INFO python.PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 45655 2021-12-27 10:51:15,361 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.5:33393 (size: 7.3 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,387 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.5:33393 (size: 27.4 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,650 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, 192.168.1.11, executor 0, partition 1, ANY, 7536 bytes) 2021-12-27 10:51:15,650 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 10251 ms on 192.168.1.11 (executor 0) (2/3) 2021-12-27 10:51:15,691 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.11:35421 (size: 7.3 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,712 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, 192.168.1.7, executor 1, partition 2, ANY, 7536 bytes) 2021-12-27 10:51:15,712 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 10314 ms on 192.168.1.7 (executor 1) (3/3) 2021-12-27 10:51:15,714 INFO scheduler.DAGScheduler: ResultStage 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) finished in 10.703 s 2021-12-27 10:51:15,714 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 2021-12-27 10:51:15,715 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.11:35421 (size: 27.4 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,719 INFO scheduler.DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 2021-12-27 10:51:15,719 INFO scheduler.TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished 2021-12-27 10:51:15,721 INFO scheduler.DAGScheduler: Job 0 finished: foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327, took 10.737965 s 2021-12-27 10:51:15,749 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.7:38801 (size: 7.3 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,779 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.7:38801 (size: 27.4 KiB, free: 14.2 GiB)

The logs from Spark UI: Spark Executor Command: "/opt/module/jdk1.8.0_212/bin/java" "-cp" "/opt/module/spark-standalone/conf/:/opt/module/spark-standalone/jars/*" "-Xmx27648M" "-Dspark.driver.port=45175" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:45175" "--executor-id" "0" "--hostname" "192.168.1.11" "--cores" "1" "--app-id" "app-20211227105103-0014" "--worker-url" "spark://[email protected]:43273"

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 21/12/27 10:51:04 INFO CoarseGrainedExecutorBackend: Started daemon with process name: [email protected] 21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for TERM 21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for HUP 21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for INT 21/12/27 10:51:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 21/12/27 10:51:04 INFO SecurityManager: Changing view acls to: amax 21/12/27 10:51:04 INFO SecurityManager: Changing modify acls to: amax 21/12/27 10:51:04 INFO SecurityManager: Changing view acls groups to: 21/12/27 10:51:04 INFO SecurityManager: Changing modify acls groups to: 21/12/27 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set() 21/12/27 10:51:04 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:45175 after 54 ms (0 ms spent in bootstraps) 21/12/27 10:51:04 INFO SecurityManager: Changing view acls to: amax 21/12/27 10:51:04 INFO SecurityManager: Changing modify acls to: amax 21/12/27 10:51:04 INFO SecurityManager: Changing view acls groups to: 21/12/27 10:51:04 INFO SecurityManager: Changing modify acls groups to: 21/12/27 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set() 21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:45175 after 1 ms (0 ms spent in bootstraps) 21/12/27 10:51:05 INFO DiskBlockManager: Created local directory at /tmp/spark-66944227-f0bf-4829-b870-f5d85feae2cd/executor-2c04267d-7763-4e33-b90c-a6c1ead9f50b/blockmgr-6ac139bf-9f9c-496c-82de-a559a0202fad 21/12/27 10:51:05 INFO MemoryStore: MemoryStore started with capacity 14.2 GiB 21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:45175 21/12/27 10:51:05 INFO WorkerWatcher: Connecting to worker spark://[email protected]:43273 21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to /192.168.1.11:43273 after 1 ms (0 ms spent in bootstraps) 21/12/27 10:51:05 INFO WorkerWatcher: Successfully connected to spark://[email protected]:43273 21/12/27 10:51:05 INFO ResourceUtils: ============================================================== 21/12/27 10:51:05 INFO ResourceUtils: Resources for spark.executor:

21/12/27 10:51:05 INFO ResourceUtils: ============================================================== 21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 21/12/27 10:51:05 INFO Executor: Starting executor ID 0 on host 192.168.1.11 21/12/27 10:51:05 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35421. 21/12/27 10:51:05 INFO NettyBlockTransferService: Server created on 192.168.1.11:35421 21/12/27 10:51:05 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 21/12/27 10:51:05 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 192.168.1.11, 35421, None) 21/12/27 10:51:05 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 192.168.1.11, 35421, None) 21/12/27 10:51:05 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 192.168.1.11, 35421, None) 21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Got assigned task 2 21/12/27 10:51:05 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) 21/12/27 10:51:05 INFO TorrentBroadcast: Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB) 21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:42229 after 2 ms (0 ms spent in bootstraps) 21/12/27 10:51:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.7 KiB, free 14.2 GiB) 21/12/27 10:51:05 INFO TorrentBroadcast: Reading broadcast variable 1 took 90 ms 21/12/27 10:51:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 17.1 KiB, free 14.2 GiB) 2021-12-27 10:51:10,558 INFO (MainThread-12074) Available GPUs: ['0'] 2021-12-27 10:51:10,559 INFO (MainThread-12074) Proposed GPUs: ['0'] 2021-12-27 10:51:10,559 INFO (MainThread-12074) Requested 1 GPU(s), setting CUDA_VISIBLE_DEVICES=0 2021-12-27 10:51:10,578 INFO (MainThread-12074) connected to server at ('192.168.1.11', 37823) 2021-12-27 10:51:10,578 INFO (MainThread-12074) TFSparkNode.reserve: {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'} 2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 0, 'host': '192.168.1.5', 'job_name': 'chief', 'task_index': 0, 'port': 44093, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-sevn96me/listener-coe66f9n', 'authkey': b'3\xd8\x85\xe7\x06UKa\x8b\xcb\xe6\x11\n\xa1)C'} 2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 1, 'host': '192.168.1.7', 'job_name': 'worker', 'task_index': 0, 'port': 36945, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-8lelocpu/listener-c8uef51e', 'authkey': b'n/\x82\x08F\x94L\xb0\xb5T\xd0\xf4\xb6\n\xb2y'} 2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'} 2021-12-27 10:51:12,582 INFO (MainThread-12074) export TF_CONFIG: {"cluster": {"chief": ["192.168.1.5:44093"], "worker": ["192.168.1.7:36945", "192.168.1.11:45615"]}, "task": {"type": "worker", "index": 1}, "environment": "cloud"} 2021-12-27 10:51:15,616 INFO (MainThread-12074) Available GPUs: ['0'] 2021-12-27 10:51:15,616 INFO (MainThread-12074) Proposed GPUs: ['0'] 2021-12-27 10:51:15,616 INFO (MainThread-12074) Requested 1 GPU(s), setting CUDA_VISIBLE_DEVICES=0 2021-12-27 10:51:15,616 INFO (MainThread-12074) Starting TensorFlow worker:1 as worker on cluster node 2 on background process 21/12/27 10:51:15 INFO PythonRunner: Times: total = 9778, boot = 316, init = 1021, finish = 8441 2021-12-27 10:51:15,627 WARNING (MainThread-12152) From /home/amax/TOS/TensorFlowOnSpark/examples/mnist/keras/mnist_spark.py:11: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version. Instructions for updating: use distribute.MultiWorkerMirroredStrategy instead 21/12/27 10:51:15 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1549 bytes result sent to driver 21/12/27 10:51:15 INFO CoarseGrainedExecutorBackend: Got assigned task 4 21/12/27 10:51:15 INFO Executor: Running task 1.0 in stage 1.0 (TID 4) 21/12/27 10:51:15 INFO TorrentBroadcast: Started reading broadcast variable 2 with 1 pieces (estimated total size 4.0 MiB) 21/12/27 10:51:15 INFO TransportClientFactory: Successfully created connection to /192.168.1.5:33393 after 1 ms (0 ms spent in bootstraps) 21/12/27 10:51:15 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.3 KiB, free 14.2 GiB) 21/12/27 10:51:15 INFO TorrentBroadcast: Reading broadcast variable 2 took 25 ms 21/12/27 10:51:15 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KiB, free 14.2 GiB) 21/12/27 10:51:15 INFO HadoopRDD: Input split: hdfs://amax11:8020/home/amax/TOS/TensorFlowOnSpark/data/mnist/csv/train/part-00001:0+11232549 21/12/27 10:51:15 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB) 21/12/27 10:51:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 14.2 GiB) 21/12/27 10:51:15 INFO TorrentBroadcast: Reading broadcast variable 0 took 9 ms 21/12/27 10:51:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 426.4 KiB, free 14.2 GiB) 2021-12-27 10:51:16.542093: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-12-27 10:51:16.980995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8107 MB memory: -> device: 0, name: GeForce RTX 3080, pci bus id: 0000:65:00.0, compute capability: 8.6 2021-12-27 10:51:16.989052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:1/device:GPU:0 with 8107 MB memory: -> device: 0, name: GeForce RTX 3080, pci bus id: 0000:65:00.0, compute capability: 8.6 2021-12-27 10:51:16.994323: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job chief -> {0 -> 192.168.1.5:44093} 2021-12-27 10:51:16.994344: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 192.168.1.7:36945, 1 -> 192.168.1.11:45615} 2021-12-27 10:51:16.995505: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://192.168.1.11:45615 2021-12-27 10:51:17,003 INFO (MainThread-12152) Enabled multi-worker collective ops with available devices: ['/job:worker/replica:0/task:1/device:CPU:0', '/job:worker/replica:0/task:1/device:GPU:0'] 2021-12-27 10:51:17,104 INFO (MainThread-12152) Waiting for the cluster, timeout = inf 2021-12-27 10:51:17,194 INFO (MainThread-12169) Connected to TFSparkNode.mgr on 192.168.1.11, executor=2, state='running' 2021-12-27 10:51:17,201 INFO (MainThread-12169) mgr.state='running' 2021-12-27 10:51:17,201 INFO (MainThread-12169) Feeding partition <itertools.chain object at 0x7f8091c66d30> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f7fe8a757f0> 21/12/27 10:51:17 INFO PythonRunner: Times: total = 772, boot = -382, init = 438, finish = 716

Spark Submit Command Line: ${SPARK_HOME}/bin/spark-submit
--master ${MASTER}
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--executor-memory 27G
${TFoS_HOME}/examples/mnist/keras/mnist_spark.py
--cluster_size ${SPARK_WORKER_INSTANCES}
--images_labels ${TFoS_HOME}/data/mnist/csv/train
--model_dir ${TFoS_HOME}/mnist_model
--export_dir ${TFoS_HOME}/mnist_export
opened by icszhr 1
How to integrate a model into Spark cluster

How can I integrate a model into a Spark cluster in real? I actually have a deep learning (tf, python) based model which I would like to integrate with the Spark cluster to do some experiments. Can anyone give me some suggestions or steps to follow to do that?

opened by jahidhasanlinix 12

Releases(v2.2.5)

v2.2.5(Apr 21, 2022)
Allow use with tensorflow-cpu package.

Dependency updates

Minor fixes.

Source code(tar.gz)
Source code(zip)
v2.2.4(May 25, 2021)
Added option to defer releasing temporary socket/port to user map_function for cases where user code may not bind to the assigned port soon enough to avoid other processes binding to the same port, e.g. extensive pre-processing before invoking TF APIs.

Updated screwdriver.cd build template.

Trigger documentation publish after PyPI push.

Source code(tar.gz)
Source code(zip)
v2.2.3(Mar 23, 2021)
Added ability to disable spark barrier execution in TFParallel

Updated with spark 3 + scala 2.12 dependencies

Fixed documentation build

Source code(tar.gz)
Source code(zip)
v2.2.2(Dec 18, 2020)
Migrated build from travis-ci to screwdriver.cd

Source code(tar.gz)
Source code(zip)
v2.2.1(Mar 16, 2020)
Added support for port ranges in TFOS_SERVER_PORT environment variable.

Updated mnist/keras/mnist_tf.py example with workaround for tensorflow datasets issue.

Added more detailed error message for missing executor_id.

Added unit tests for gpu allocation variants.

Source code(tar.gz)
Source code(zip)
v2.2.0(Feb 19, 2020)
Added support for Spark 3.0 GPU resources

Updated to support Spark 2.4.5

Fixed dataset ordering in mnist_inference.py examples (thanks to @qsbao)

Added optional environment variables to configure TF server/grpc ports and TensorBoard ports on executors

Fixed bug with TFNode.start_cluster_server in backwards-compatibility code for TF1.x

Fixed file conflict issue with compat.export_saved_model in TF2.1

Removed support for Python 2.x

Source code(tar.gz)
Source code(zip)
v2.1.3(Jan 22, 2020)
Detect TF version w/o importing to avoid runtime initialization before GPU allocation.

Source code(tar.gz)
Source code(zip)
v2.1.2(Jan 10, 2020)
Use tf.config.list_physical_devices() to avoid TF runtime initialization.

Source code(tar.gz)
Source code(zip)
v2.1.1(Jan 9, 2020)
added compat.is_gpu_available() method to use:

tf.config.list_logical_devices('GPU') (for TF2.1)

tf.test.is_cuda_available() (for earlier versions of TF).

added ability to launch TensorBoard on chief:0 or master:0 nodes (for small clusters without worker nodes).

Source code(tar.gz)
Source code(zip)
v2.1.0(Dec 9, 2019)
Added compat module to manage minor API changes in TensorFlow.

Added compatibility for TF2.1.0rc0 (exporting saved_models and configuring auto-shard policy)

Re-introduced compatibility for TF1.x (except support for InputMode.TENSORFLOW in the ML Pipeline API).

Added TFParallel class for parallelized single-node inferencing via Spark executors.

Updated examples for TF API changes.

Updated to use module-level loggers.

Source code(tar.gz)
Source code(zip)
v2.0.0(Oct 2, 2019)
initial release compatible with TensorFlow 2.x.

API changes:

removed TFNode.start_cluster_server, which is not required for tf.keras and tf.estimator.

removed TFNode.export_saved_model, which can be replaced by TF native APIs now.

added TFNodeContext.num_workers to count master, chief, and worker nodes.

Spark ML Pipeline API changes:

Scala API has been removed for now, since the Java library for TensorFlow 2.0 is not available yet.

removed InputMode.TENSORFLOW support for ML Pipelines, since the input data is always a Spark DataFrame for this API.

added HasMasterNode and HasGraceSecs params.

removed optional export_fn argument for Spark ML TFEstimator (use TF export APIs instead).

added standard default values for signature_def_key and tag_set for Spark ML TFModel.

modified inferencing code in TFModel for TF2.x APIs.

older TF 1.x examples have been replaced with TF 2.x compatible examples.

Source code(tar.gz)
Source code(zip)
v1.4.4(Sep 30, 2019)
last expected release compatible with TensorFlow 1.x (aside from any critical fixes), since the master branch will be moving to TF 2.0 compatibility.

handle multiple outputs with signaturedef (thanks to @markromedia).

handle exceptions after data feeding.

moved API docs to sphinx_rtd_theme.

updated to Spark 2.4.4.

Source code(tar.gz)
Source code(zip)
v1.4.3(Apr 6, 2019)
removed tensorflow as a dependency, in order to support other variants like tensorflow-gpu or tf-nightly.

allow use of evaluator node type in cluster (thanks to @bbshetty)

refactored cluster template generation.

updated wide-deep example to use models/official code.

restore termination of feed in mnist/spark example.

updated sample notebook instructions.

updated to use Spark 2.3.3.

Source code(tar.gz)
Source code(zip)
v1.4.2(Jan 22, 2019)
Set TF_CONFIG for "chief" clusters (required by DistributionStrategy APIs)

Fix GPU allocation for multi-gpu nodes

Updated examples for MNIST

Updated Hadoop and Spark dependency versions

Source code(tar.gz)
Source code(zip)
v1.4.1(Dec 3, 2018)
Added util.single_node_env(), which can be used to initialize the environment (HDFS compatibility + GPU allocation) for running a single-node instance of TensorFlow on the Spark driver.

Added an example of parallelized inferencing from a pre-trained SavedModel.

Source code(tar.gz)
Source code(zip)
v1.4.0(Nov 16, 2018)
More deterministic GPU allocation for multi-GPU nodes.

Added timeout argument to TFCluster.shutdown() (default is 3 days). This is intended to shutdown the Spark application in the event that any of the TF nodes hang for any reason. Set to -1 to disable timeout.

Added ability to start reservation server on a specific port (contributed by @AvihayTsayeg).

Updated pipeline API for latest TF APIs (contributed by @AvihayTsayeg)

Added unit test for tf.SparseTensor support.

Updated examples to latest TF APIs (including workaround for https://github.com/tensorflow/tensorflow/issues/21745).

Updated Spark version dependency for Scala Inferencing API.

Added __version__ to module.

Source code(tar.gz)
Source code(zip)
v1.3.4(Sep 27, 2018)
Travis CI integration for Python documentation and Scala Inferencing API builds.

Added sys.path to tensorboard search path.

Source code(tar.gz)
Source code(zip)
v1.3.3(Sep 6, 2018)
Only set TF_CONFIG environment variable if cluster_spec has a "master", i.e. when using tf.estimator.

Updated mnist/keras/mnist_mlp_estimator.py with example of distributed/parallel inferencing via estimator.predict.

Added optional feed_timeout argument to TFCluster.train() for InputMode.SPARK.

Added optional grace_secs argument to TFCluster.shutdown().

Workaround for firewall proxy issue with get_ip_address (contributed by @viplav).

Add support for all Hadoop-compatible File System schemes (contributed by @vishnu2kmohan).

Added error messages to assert statements.

Initial Travis CI integration.

Source code(tar.gz)
Source code(zip)
v1.3.2(Jul 13, 2018)
add grace period to TFCluster.shutdown()

add wide & deep example (contributed by @crafet)

update mnist/pipeline examples to tf.data, add instructions, and misc code cleanup (from @yileic)

parameterize versions in pom.xml and code cleanup (from @tmielika)

update Scala Inferencing pom.xml to latest tensorflow-hadoop artifact (contributed by @psuszyns)

Source code(tar.gz)
Source code(zip)
v1.3.1(Jul 13, 2018)
Add keras/estimator example

Update original keras example to latest tf.keras apis

Update Scala Inferencing pom.xml to latest TF java version

Allow PS to use CPU on TF-GPU builds (contributed by @dratini6)

More pep8

More py2/py3 compat

Source code(tar.gz)
Source code(zip)
v1.3.0(Apr 1, 2018)
support tf.estimator.train_and_evaluate() API

use local file instead of ppid to uniquely identify executors

surface GPU allocation errors more readily

add sharding, epochs, and shuffling to mnist Dataset example.

TFoS example for criteo data (contributed by @amantrac)

use tf.train.MonitoredTrainingSession in mnist/spark example (contributed by @wuyifan18)

Source code(tar.gz)
Source code(zip)
v1.2.1(Apr 1, 2018)
Error handling for TF exceptions in InputMode.SPARK (from @eordentlich).

Add timeout for reservations (from @eordentlich).

Errors will exit Spark job with non-zero exit code.

Fix regression in pipeline API.

Model export tool

Source code(tar.gz)
Source code(zip)
v1.2.0(Apr 1, 2018)
Added Scala Inference API

Support driver-side PS node(s) (contributed by @winston-zillow)

Source code(tar.gz)
Source code(zip)

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

Related tags

Overview

TensorFlowOnSpark

Table of Contents

Background

Install

Usage

API

Contribute

License

Comments

Bump protobuf-java from 3.16.1 to 3.16.3

Protobuf Release v3.16.3

Java

Evalator hangs while training

do we support scala & java code write tensorflow model with tenorflow-core-api ?

Get stuck at "Added broadcast_0_piece0 in memory on" while runing Spark standalone cluster

How to integrate a model into Spark cluster

Releases(v2.2.5)

v2.2.5(Apr 21, 2022)

v2.2.4(May 25, 2021)

v2.2.3(Mar 23, 2021)

v2.2.2(Dec 18, 2020)

v2.2.1(Mar 16, 2020)

v2.2.0(Feb 19, 2020)

v2.1.3(Jan 22, 2020)

v2.1.2(Jan 10, 2020)

v2.1.1(Jan 9, 2020)

v2.1.0(Dec 9, 2019)

v2.0.0(Oct 2, 2019)

v1.4.4(Sep 30, 2019)

v1.4.3(Apr 6, 2019)

v1.4.2(Jan 22, 2019)

v1.4.1(Dec 3, 2018)

v1.4.0(Nov 16, 2018)

v1.3.4(Sep 27, 2018)

v1.3.3(Sep 6, 2018)

v1.3.2(Jul 13, 2018)

v1.3.1(Jul 13, 2018)

v1.3.0(Apr 1, 2018)

v1.2.1(Apr 1, 2018)

v1.2.0(Apr 1, 2018)

Owner

Yahoo

Contrastive Loss Gradient Attack (CLGA)

Aerial Single-View Depth Completion with Image-Guided Uncertainty Estimation (RA-L/ICRA 2020)

Official implementation of "Membership Inference Attacks Against Self-supervised Speech Models"

This repository includes code of my study about Asynchronous in Frequency domain of GAN images.

Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences forImage-Text Retrieval

Learning Visual Words for Weakly-Supervised Semantic Segmentation

Internship Assessment Task for BaggageAI.

OverFeat is a Convolutional Network-based image classifier and feature extractor.

Code and model benchmarks for "SEVIR : A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology"

The FIRST GANs-based omics-to-omics translation framework

High performance distributed framework for training deep learning recommendation models based on PyTorch.

LUKE -- Language Understanding with Knowledge-based Embeddings

Code for the Paper: Alexandra Lindt and Emiel Hoogeboom.

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion.

TensorFlow implementation of "Learning from Simulated and Unsupervised Images through Adversarial Training"

Unofficial Pytorch Lightning implementation of Contrastive Syn-to-Real Generalization (ICLR, 2021)

CR-Fill: Generative Image Inpainting with Auxiliary Contextual Reconstruction. ICCV 2021

Python/Rust implementations and notes from Proofs Arguments and Zero Knowledge