尝试使用Terraform部署Databricks工作区

由于之前只是翻译了文件,实际上并没有做过,所以我按照这篇文章的步骤进行部署。Git的步骤将被忽略。

 

这里也有说明步骤的。

 

请注意,本文是使用Mac进行的操作。
这是将应用部署至AWS的过程。采用了客户管理VPC并且未使用PrivateLink的部署配置。

准备好

Terraform的安装

 

在终端中执行以下命令。

brew tap hashicorp/tap
brew install hashicorp/tap/terraform

执行以上命令时,

==> Installing terraform from hashicorp/tap
Error: Your Command Line Tools are too outdated.
Update them from Software Update in System Preferences.

If that doesn't show you any updates, run:
  sudo rm -rf /Library/Developer/CommandLineTools
  sudo xcode-select --install

Alternatively, manually download them from:
  https://developer.apple.com/download/all/.
You should download the Command Line Tools for Xcode 13.4.

如果出现以下错误,请执行以下操作来更新命令行工具。

sudo rm -rf /Library/Developer/CommandLineTools
sudo xcode-select --install

安装和配置AWS CLI。

我会使用这个GUI安装程序来安装。

 

参考此设定,请获取AWS访问密钥,并在执行aws configure时指定。

 

Terraform的设定 (Terraform de

创建一个工作目录并进入该目录。

mkdir normal_workspace
cd normal_workspace

在接下来的过程中,我将创建一些文件。

vars.tf -> 变量.tf

这是一个定义变量的文件。请根据需要更新要部署到的AWS区域(region)和VPC的CIDR(cidr_block)。

variable "databricks_account_username" {}
variable "databricks_account_password" {}
variable "databricks_account_id" {}

variable "tags" {
  default = {}
}

variable "cidr_block" {
  default = "10.4.0.0/16"
}

variable "region" {
  default = "ap-northeast-1"
}

// See https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/string
resource "random_string" "naming" {
  special = false
  upper   = false
  length  = 6
}

locals {
  prefix = "demo-${random_string.naming.result}"
}

初始化.tf

使用必要的Databricks提供程序和AWS提供程序来初始化Terraform。

terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
      version = "1.0.0"
    }
    aws = {
      source = "hashicorp/aws"
      version = "3.49.0"
    }
  }
}

provider "aws" {
  region = var.region
}

// Initialize provider in "MWS" mode to provision the new workspace.
// alias = "mws" instructs Databricks to connect to https://accounts.cloud.databricks.com, to create
// a Databricks workspace that uses the E2 version of the Databricks on AWS platform.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs#authentication
provider "databricks" {
  alias    = "mws"
  host     = "https://accounts.cloud.databricks.com"
  username = var.databricks_account_username
  password = var.databricks_account_password
}

跨账户角色.tf

我会在您的AWS帐户上创建所需的IAM跨帐户角色和相关策略。

请注意,以下的”time_sleep.wait_for_cross_account_role”资源是为了等待IAM角色的创建而设置的。
// Create the required AWS STS assume role policy in your AWS account.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs/data-sources/aws_assume_role_policy
data "databricks_aws_assume_role_policy" "this" {
  external_id = var.databricks_account_id
}

// Create the required IAM role in your AWS account.
// See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role
resource "aws_iam_role" "cross_account_role" {
  name               = "${local.prefix}-crossaccount"
  assume_role_policy = data.databricks_aws_assume_role_policy.this.json
  tags               = var.tags
}

// Create the required AWS cross-account policy in your AWS account.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs/data-sources/aws_crossaccount_policy
data "databricks_aws_crossaccount_policy" "this" {}

// Create the required IAM role inline policy in your AWS account.
// See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy
resource "aws_iam_role_policy" "this" {
  name   = "${local.prefix}-policy"
  role   = aws_iam_role.cross_account_role.id
  policy = data.databricks_aws_crossaccount_policy.this.json
}

resource "time_sleep" "wait_for_cross_account_role" {
  depends_on      = [aws_iam_role_policy.this, aws_iam_role.cross_account_role]
  create_duration = "20s"
}

// Properly configure the cross-account role for the creation of new workspaces within your AWS account.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/mws_credentials
resource "databricks_mws_credentials" "this" {
  provider         = databricks.mws
  account_id       = var.databricks_account_id
  role_arn         = aws_iam_role.cross_account_role.arn
  credentials_name = "${local.prefix}-creds"
  depends_on       = [time_sleep.wait_for_cross_account_role]
}

vpc.tf 的含义是什么?

我将使用Terraform指示您在您的AWS账户上创建Databricks所需的VPC。

// Allow access to the list of AWS Availability Zones within the AWS Region that is configured in vars.tf and init.tf.
// See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/availability_zones
data "aws_availability_zones" "available" {}

// Create the required VPC resources in your AWS account.
// See https://registry.terraform.io/modules/terraform-aws-modules/vpc/aws/latest
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "3.2.0"

  name = local.prefix
  cidr = var.cidr_block
  azs  = data.aws_availability_zones.available.names
  tags = var.tags

  enable_dns_hostnames = true
  enable_nat_gateway   = true
  single_nat_gateway   = true
  create_igw           = true

  public_subnets  = [cidrsubnet(var.cidr_block, 3, 0)]
  private_subnets = [cidrsubnet(var.cidr_block, 3, 1),
                     cidrsubnet(var.cidr_block, 3, 2)]

  manage_default_security_group = true
  default_security_group_name = "${local.prefix}-sg"

  default_security_group_egress = [{
    cidr_blocks = "0.0.0.0/0"
  }]

  default_security_group_ingress = [{
    description = "Allow all internal TCP and UDP"
    self        = true
  }]
}

// Create the required VPC endpoints within your AWS account.
// See https://registry.terraform.io/modules/terraform-aws-modules/vpc/aws/latest/submodules/vpc-endpoints
module "vpc_endpoints" {
  source = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
  version = "3.2.0"

  vpc_id             = module.vpc.vpc_id
  security_group_ids = [module.vpc.default_security_group_id]

  endpoints = {
    s3 = {
      service         = "s3"
      service_type    = "Gateway"
      route_table_ids = flatten([
        module.vpc.private_route_table_ids,
        module.vpc.public_route_table_ids])
      tags            = {
        Name = "${local.prefix}-s3-vpc-endpoint"
      }
    },
    sts = {
      service             = "sts"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
      tags                = {
        Name = "${local.prefix}-sts-vpc-endpoint"
      }
    },
    kinesis-streams = {
      service             = "kinesis-streams"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
      tags                = {
        Name = "${local.prefix}-kinesis-vpc-endpoint"
      }
    }
  }

  tags = var.tags
}

// Properly configure the VPC and subnets for Databricks within your AWS account.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/mws_networks
resource "databricks_mws_networks" "this" {
  provider           = databricks.mws
  account_id         = var.databricks_account_id
  network_name       = "${local.prefix}-network"
  security_group_ids = [module.vpc.default_security_group_id]
  subnet_ids         = module.vpc.private_subnets
  vpc_id             = module.vpc.vpc_id
}

根桶.tf

我将在您的AWS帐户中创建Databricks所需的S3根存储桶。

// Create the S3 root bucket.
// See https://registry.terraform.io/modules/terraform-aws-modules/s3-bucket/aws/latest
resource "aws_s3_bucket" "root_storage_bucket" {
  bucket = "${local.prefix}-rootbucket"
  acl    = "private"
  versioning {
    enabled = false
  }
  force_destroy = true
  tags = merge(var.tags, {
    Name = "${local.prefix}-rootbucket"
  })
}

// Ignore public access control lists (ACLs) on the S3 root bucket and on any objects that this bucket contains.
// See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_public_access_block
resource "aws_s3_bucket_public_access_block" "root_storage_bucket" {
  bucket             = aws_s3_bucket.root_storage_bucket.id
  ignore_public_acls = true
  depends_on         = [aws_s3_bucket.root_storage_bucket]
}

// Configure a simple access policy for the S3 root bucket within your AWS account, so that Databricks can access data in it.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs/data-sources/aws_bucket_policy
data "databricks_aws_bucket_policy" "this" {
  bucket = aws_s3_bucket.root_storage_bucket.bucket
}

// Attach the access policy to the S3 root bucket within your AWS account.
// See https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_policy
resource "aws_s3_bucket_policy" "root_bucket_policy" {
  bucket     = aws_s3_bucket.root_storage_bucket.id
  policy     = data.databricks_aws_bucket_policy.this.json
  depends_on = [aws_s3_bucket_public_access_block.root_storage_bucket]
}

// Configure the S3 root bucket within your AWS account for new Databricks workspaces.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/mws_storage_configurations
resource "databricks_mws_storage_configurations" "this" {
  provider                   = databricks.mws
  account_id                 = var.databricks_account_id
  bucket_name                = aws_s3_bucket.root_storage_bucket.bucket
  storage_configuration_name = "${local.prefix}-storage"
}

工作空间.tf

给您的Databricks帐户创建一个工作区,通过Terraform进行指示。

// Set up the Databricks workspace to use the E2 version of the Databricks on AWS platform.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/mws_workspaces
resource "databricks_mws_workspaces" "this" {
  provider        = databricks.mws
  account_id      = var.databricks_account_id
  aws_region      = var.region
  workspace_name  = local.prefix
  deployment_name = local.prefix

  credentials_id           = databricks_mws_credentials.this.credentials_id
  storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
  network_id               = databricks_mws_networks.this.network_id
}

// Capture the Databricks workspace's URL.
output "databricks_host" {
  value = databricks_mws_workspaces.this.workspace_url
}

// Initialize the Databricks provider in "normal" (workspace) mode.
// See https://registry.terraform.io/providers/databricks/databricks/latest/docs#authentication
provider "databricks" {
  // In workspace mode, you don't have to give providers aliases. Doing it here, however,
  // makes it easier to reference, for example when creating a Databricks personal access token
  // later in this file.
  alias = "created_workspace"
  host = databricks_mws_workspaces.this.workspace_url
}

// Create a Databricks personal access token, to provision entities within the workspace.
resource "databricks_token" "pat" {
  provider = databricks.created_workspace
  comment  = "Terraform Provisioning"
  lifetime_seconds = 86400
}

// Export the Databricks personal access token's value, for integration tests to run on.
output "databricks_token" {
  value     = databricks_token.pat.token_value
  sensitive = true
}

教程.tfvars

请指定上述文件中引用的Databricks账户ID、账户所有者的用户ID和密码。不建议在文件中硬编码,因此将其拆分为单独的文件。如果使用git,请在.gitignore文件中包含*.tfvars以排除这些扩展名的文件。

databricks_account_username = "<your-Databricks-account-username>"
databricks_account_password = "<your-Databricks-account-password>"
databricks_account_id = "<your-Databricks-account-ID>"

使用Terraform创建Databricks和AWS资源

通过执行以下步骤,将生成所定义的资源并部署工作空间。

terraform init
terraform apply -var-file="tutorial.tfvars"
Screenshot 2023-01-10 at 17.17.27.png
Screenshot 2023-01-10 at 17.19.58.png
Screenshot 2023-01-10 at 17.39.51.png

集群也已经启动了。多么方便啊。

清理工作

用以下命令将所有资源销毁。同时需要提供Databricks账户ID、账户所有者用户名和密码。

terraform destroy

因为经常会忘记资源的清理,所以这个功能非常有帮助。

下一步,我将尝试其他的部署模式。

Databricks 免费试用

Databricks 免费试用

广告
将在 10 秒后关闭
bannerAds