当前位置：网站首页>How to use databricks for data analysis on tidb cloud | tidb cloud User Guide

How to use databricks for data analysis on tidb cloud | tidb cloud User Guide

2022-07-28 13:00:00 【InfoQ】

author

Wu Qiang （PingCAP TiDB Cloud Team Engineer ）

edit

丨 Calvin Weng、Tom Dewan

TiDB Cloud

Is for open source distributed databases TiDB Create full custody DBaaS (Database-as-a-Service) service .

Databricks

It's a car with Spark, And web based data analysis platform .Databricks Our data Lake architecture integrates the best data warehouse and data Lake in the industry .

With the help of Databricks Built in JDBC The driver , It only takes a few minutes to TiDB Cloud Dock to Databricks, And then you can go through Databricks analysis TiDB Data in .

This article mainly introduces how to create TiDB Cloud Developer Tier colony 、 How to integrate TiDB Dock to Databricks, And how to use it Databricks Handle TiDB Data in .

Set up TiDB Cloud Dev Tier colony

Use TiDB Cloud front , The following operations are required ：

stay
Create Cluster > Developer Tier
Under menu , choice
1 year Free Trial
.

Set cluster name , And select a region for the cluster .

single click
Create
. about 1~3 Minutes later ,TiDB Cloud Cluster created successfully .

stay
Overview
panel , single click
Connect
And create a flow filter . for example , add to IP Address 0.0.0.0/0, Allow all IP visit .

JDBC URL Later in Databricks Use in , Please keep a record .

Import the sample data TiDB Cloud

After creating the cluster , You can import the sample data to TiDB Cloud. We will use the shared bike platform Capital Bikeshare As a demonstration . The use of sample data completely follows Capital Bikeshare Company's data license agreement .

In the cluster information pane , single click Import. And then , Will appear Data Import Task page .

Configure the import task as follows ：

Data Source Type :
Amazon S3

Bucket URL :
s3://tidbcloud-samples/data-ingestion/

Data Format :
TiDB Dumpling

Role-ARN :
arn:aws:iam::385595570414:role/import-sample-access

To configure
Target Database
when , type TiDB Clustered
Username
and
Password
.

single click
Import
, Start importing sample data . The whole process will last about 3 minute .

Return to the overview panel , single click
Connect to Get the MyCLI URL
.

Use MyCLI The client checks whether the sample data is imported successfully ：

$ mycli -u root -h tidb.xxxxxx.aws.tidbcloud.com -P 4000

(none)> SELECT COUNT(*) FROM bikeshare.trips; 
+----------+
| COUNT(*) |
+----------+
| 816090 |
+----------+
1 row in set
Time: 0.786s

Use Databricks Connect TiDB Cloud

Before the start , Please make sure that you have logged in to... With your own account Databricks work area . If you don't Databricks account number , Please register one for free first . If you have rich Databricks Use experience , And want to import the notebook directly , Skippable （ Optional ） take TiDB Cloud Sample notebook import Databricks.

In this chapter , We will create a new Databricks Notebook, And associate it with a Spark colony , Subsequently passed JDBC URL Connect the created notebook to TiDB Cloud.

stay Databricks work area , Create and associate as follows Spark colony ：

stay Databricks Configure in notebook JDBC.TiDB have access to Databricks default JDBC The driver , Therefore, there is no need to configure driver parameters ：

%scala
val url = &quot;jdbc:mysql://tidb.xxxx.prod.aws.tidbcloud.com:4000&quot;
val table = &quot;bikeshare.trips&quot;
val user = &quot;root&quot;
val password = &quot;xxxxxxxxxx&quot;

The configuration parameters are described below ：

url
： Used to connect to TiDB Cloud Of JDBC URL

table
： Specify data sheets , for example ：

{table}

user
： Used to connect to TiDB Cloud Of user name

password
： User's password

Check TiDB Cloud The connectivity of ：

%scala
import java.sql.DriverManager
val connection = DriverManager.getConnection(url, user, password)
connection.isClosed()
res2: Boolean = false

stay Databricks Analyze the data

As long as the connection is successfully established , Can be TiDB The data is loaded as Spark DataFrame, And in Databricks Analyze these data in .

Create a Spark DataFrame Used for loading TiDB data . here , We will refer to the variables defined in the previous steps ：

%scala
val remote_table = spark.read.format(&quot;jdbc&quot;)
.option(&quot;url&quot;, url)
.option(&quot;dbtable&quot;, table)
.option(&quot;user&quot;, user)
.option(&quot;password&quot;, password)
.load()

Query data .Databricks Provide powerful chart display function , You can customize the chart type ：

%scala
display(remote_table.select(&quot;*&quot;))

Create a DataFrame View or one DataFrame surface . Let's create a name “trips” As an example ：

%scala
remote_table.createOrReplaceTempView(&quot;trips&quot;)

Use SQL Statement query data . The following statement will query the number of each type of single vehicle ：

%sql
SELECT rideable_type, COUNT(*) count FROM trips GROUP BY rideable_type ORDER BY count DESC

Write the analysis results to TiDB Cloud：

%scala
spark.table(&quot;type_count&quot;)
.withColumnRenamed(&quot;type&quot;, &quot;count&quot;)
.write
.format(&quot;jdbc&quot;)
.option(&quot;url&quot;, url)
.option(&quot;dbtable&quot;, &quot;bikeshare.type_count&quot;)
.option(&quot;user&quot;, user)
.option(&quot;password&quot;, password)
.option(&quot;isolationLevel&quot;, &quot;NONE&quot;)
.mode(SaveMode.Append)
.save()

take TiDB Cloud Sample notebook import Databricks

What we use TiDB Cloud The sample notebook contains the use Databricks Connect TiDB Cloud And in Databricks Middle analysis TiDB Data two steps . You can import the sample notebook directly , In order to focus on the analysis process .

stay Databricks work area , single click
Create > Import
, And paste TiDB Cloud Examples URL, Download the notebook to your Databricks work area .

Associate this notebook with your Spark colony .

Use your own TiDB Cloud Cluster information replaces JDBC To configure .

Follow the steps in the notebook , adopt Databricks Use TiDB Cloud.

summary

This article mainly introduces how to pass Databricks Use TiDB Cloud.

meanwhile , We are writing another tutorial , Used to introduce how to pass TiSpark（TiDB/TiKV The upper layer is used for operation Apache Spark Lightweight query layer , Project links ：

https://github.com/pingcap/tispark

） stay TiDB Upper use Databricks Data analysis , Coming soon .

Link to the original text ：https://en.pingcap.com/blog/analytics-on-tidb-cloud-with-databricks/

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207281153524980.html