geobeam - adds GIS capabilities to your Apache Beam and Dataflow pipelines.

Overview

geobeam adds GIS capabilities to your Apache Beam pipelines.

What does geobeam do?

geobeam enables you to ingest and analyze massive amounts of geospatial data in parallel using Dataflow. geobeam provides a set of FileBasedSource classes that make it easy to read, process, and write geospatial data, and provides a set of helpful Apache Beam transforms and utilities that make it easier to process GIS data in your Dataflow pipelines.

See the Full Documentation for complete API specification.

Requirements

  • Apache Beam 2.27+
  • Python 3.7+

Note: Make sure the Python version used to run the pipeline matches the version in the built container.

Supported input types

File format Data type Geobeam class
tiff raster GeotiffSource
shp vector ShapefileSource
gdb vector GeodatabaseSource

Included libraries

geobeam includes several python modules that allow you to perform a wide variety of operations and analyses on your geospatial data.

Module Version Description
gdal 3.2.1 python bindings for GDAL
rasterio 1.1.8 reads and writes geospatial raster data
fiona 1.8.18 reads and writes geospatial vector data
shapely 1.7.1 manipulation and analysis of geometric objects in the cartesian plane

How to Use

1. Install the module

pip install geobeam

2. Write your pipeline

Write a normal Apache Beam pipeline using one of geobeams file sources. See geobeam/examples for inspiration.

3. Run

Run locally

python -m geobeam.examples.geotiff_dem \
  --gcs_url gs://geobeam/examples/dem-clipped-test.tif \
  --dataset=examples \
  --table=dem \
  --band_column=elev \
  --centroid_only=true \
  --runner=DirectRunner \
  --temp_location <temp gs://> \
  --project <project_id>

You can also run "locally" in Cloud Shell using the py-37 container variants

Note: Some of the provided examples may take a very long time to run locally...

Run in Dataflow

Write a Dockerfile

This will run in Dataflow as a custom container based on the dataflow-geobeam/base image. See [geobeam/examples/Dockerfile] for an example that installed the latest geobeam from source.

FROM gcr.io/dataflow-geobeam/base
# FROM gcr.io/dataflow-geobeam/base-py37

RUN pip install geobeam

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
# build locally with docker
docker build -t gcr.io/<project_id>/example
docker push gcr.io/<project_id>/example

# or build with Cloud Build
gcloud builds submit --tag gcr.io/<project_id>/<name> --timeout=3600s --machine-type=n1-highcpu-8

Start the Dataflow job

Note on Python versions

If you are starting a Dataflow job on a machine running Python 3.7, you must use the images suffixed with py-37. (Cloud Shell runs Python 3.7 by default, as of Feb 2021). A separate version of the base image is built for Python 3.7, and is available at gcr.io/dataflow-geobeam/base-py37. The Python 3.7-compatible examples image is similarly-named gcr.io/dataflow-geobeam/example-py37

# run the geotiff_soilgrid example in dataflow
python -m geobeam.examples.geotiff_soilgrid \
  --gcs_url gs://geobeam/examples/AWCh3_M_sl1_250m_ll.tif \
  --dataset=examples \
  --table=soilgrid \
  --band_column=h3 \
  --runner=DataflowRunner \
  --worker_harness_container_image=gcr.io/dataflow-geobeam/example \
  --experiment=use_runner_v2 \
  --temp_location=<temp bucket> \
  --service_account_email <service account> \
  --region us-central1 \
  --max_num_workers 2 \
  --machine_type c2-standard-30 \
  --merge_blocks 64

Examples

Polygonize Raster

def run(options):
  from geobeam.io import GeotiffSource
  from geobeam.fn import format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadRaster' >> beam.io.Read(GeotiffSource(gcs_url))
        | 'FormatRecord' >> beam.Map(format_record, 'elev', 'float')
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.dem'))

Validate and Simplify Shapefile

def run(options):
  from geobeam.io import ShapefileSource
  from geobeam.fn import make_valid, filter_invalid, format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadShapefile' >> beam.io.Read(ShapefileSource(gcs_url))
        | 'Validate' >> beam.Map(make_valid)
        | 'FilterInvalid' >> beam.Filter(filter_invalid)
        | 'FormatRecord' >> beam.Map(format_record)
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.parcel'))

See geobeam/examples/ for complete examples.

A number of example pipelines are available in the geobeam/examples/ folder. To run them in your Google Cloud project, run the included terraform file to set up the Bigquery dataset and tables used by the example pipelines.

Open up Bigquery GeoViz to visualize your data.

Shapefile Example

The National Flood Hazard Layer loaded from a shapefile. Example pipeline at geobeam/examples/shapefile_nfhl.py

Raster Example

The Digital Elevation Model is a high-resolution model of elevation measurements at 1-meter resolution. (Values converted to centimeters). Example pipeline: geobeam/examples/geotiff_dem.py.

Included Transforms

The geobeam.fn module includes several Beam Transforms that you can use in your pipelines.

Module Description
geobeam.fn.make_valid Attempt to make all geometries valid.
geobeam.fn.filter_invalid Filter out invalid geometries that cannot be made valid
geobeam.fn.format_record Format the (props, geom) tuple received from a FileSource into a dict that can be inserted into the destination table

Execution parameters

Each FileSource accepts several parameters that you can use to configure how your data is loaded and processed. These can be parsed as pipeline arguments and passed into the respective FileSources as seen in the examples pipelines.

Parameter Input type Description Default Required?
skip_reproject All True to skip reprojection during read False No
in_epsg All An EPSG integer to override the input source CRS to reproject from No
band_number Raster The raster band to read from 1 No
include_nodata Raster True to include nodata values False No
centroid_only Raster True to only read pixel centroids False No
merge_blocks Raster Number of block windows to combine during read. Larger values will generate larger, better-connected polygons. No
layer_name Vector Name of layer to read Yes
gdb_name Vector Name of geodatabase directory in a gdb zip archive Yes, for GDB files

License

This is not an officially supported Google product, though support will be provided on a best-effort basis.

Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Comments
  • Add get_bigquery_schema_dataflow

    Add get_bigquery_schema_dataflow

    Added get_bigquery_schema_dataflow to create a schema that can read files from Google Cloud Storage and generate schemas that can be used natively with Google Dataflow

    opened by mbforr 7
  • Unable to get worker harness container image

    Unable to get worker harness container image

    Unable to get the worker harness container images used in the examples.

    ~~➜ docker pull gcr.io/dataflow-geobeam/example-py37~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/base-py37~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/example~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/base~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ➜ docker pull gcr.io/dataflow-geobeam/base
    Using default tag: latest
    Error response from daemon: manifest for gcr.io/dataflow-geobeam/base:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/dataflow-geobeam/base/manifests/latest"
    
    opened by jskontorp 5
  • Encountering errors while trying to run examples geotiff examples

    Encountering errors while trying to run examples geotiff examples

    tl;dr This may be an issue with my environment (I'm running Python 3.9.13), but I've had no success getting any of the examples involving gridded data (e.g., geobeam.examples.geotiff_dem) to run locally. Have these been tested recently?

    I was bumping up against TypeError: only size-1 arrays can be converted to Python scalars [while running 'ElevToCentimeters'], which were fixed by using x.astype(int) where appropriate. Then I hit TypeError: format_record() takes from 1 to 2 positional arguments but 3 were given [while running 'FormatRecords']. (Maybe this line needs to be something like 'FormatRecords' >> beam.Map(format_rasterpolygon_record, 'int', known_args.band_column) instead?) Then I got TypeError: Object of type 'ndarray' is not JSON serializable [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] and stopped to jump in here. If it's just my environment, then I'm making changes needlessly. If it's the code, it seemed better for these fixes to be applied directly in the upstream repo.

    Any advice you have would be much appreciated!!

    bug documentation 
    opened by lzachmann 3
  • ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    I am fairly new to python and Apache beam, however, I used the shapefile_nfhl.py as an example to create a reader for GeoJSON files, therefore I imported the GeoJSONSource (as per documentation) from geobeam.io but when I run the application I get the following error ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    Am I missing something as I did follow the instructions to install geobeam. pip install geobeam

    I have tried this with python 3.7, 3.9 and 3.10, versions 3.7 and 3.9 gives this error where as 3.10 does not work at all - getting issues while installing rasterio.

    I am also running this on macOS Monterey (12.2.1)

    Here is my code:

    def run(pipeline_args, known_args): 
        import apache_beam as beam
        from apache_beam.io.gcp.internal.clients import bigquery as beam_bigquery
        from apache_beam.options.pipeline_options import PipelineOptions, SetupOptions
        from geobeam.io import GeoJSONSource
        from geobeam.fn import format_record, make_valid, filter_invalid
    
        pipeline_options = PipelineOptions([
            '--experiments', 'use_beam_bq_sink',
        ] + pipeline_args)
    
        with beam.Pipeline(options=pipeline_options) as p:
            (p
             | beam.io.Read(GeoJSONSource(known_args.gcs_url,
                 layer_name=known_args.layer_name))
             | 'MakeValid' >> beam.Map(make_valid)
             | 'FilterInvalid' >> beam.Filter(filter_invalid)
             | 'FormatRecords' >> beam.Map(format_record)
             | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
                 beam_bigquery.TableReference(
                     datasetId=known_args.dataset,
                     tableId=known_args.table),
                 method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
                 write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                 create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
    
    
    if __name__ == '__main__':
        import logging
        import argparse
    
        logging.getLogger().setLevel(logging.INFO)
    
        parser = argparse.ArgumentParser()
        parser.add_argument('--gcs_url')
        parser.add_argument('--dataset')
        parser.add_argument('--table')
        parser.add_argument('--layer_name')
        parser.add_argument('--in_epsg', type=int, default=None)
        known_args, pipeline_args = parser.parse_known_args()
    
        run(pipeline_args, known_args)```
    opened by migaelhartzenberg 3
  • Issue installing geobeam on GCP CloudShell

    Issue installing geobeam on GCP CloudShell

    Seeing the below issue while installing geobeam on GCP Cloud Shell.

    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-07s73wh5/orjson/
    

    Version of Python used from venv is 3.7.3

    (env) [email protected]:~ $ python
    Python 3.7.3 (default, Jan 22 2021, 20:04:44)
    [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    

    Detailed error message

    Collecting orjson<4.0; python_version >= "3.6" (from apache-beam[gcp]>=2.27.0->geobeam)
      Using cached https://files.pythonhosted.org/packages/75/cd/eac8908d0b4a4b08067bc68c04e52d7601b0ed86bf2a6a3264c46dd51a84/orjson-3.6.3.tar.gz
      Installing build dependencies ... done
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/usr/lib/python3.7/tokenize.py", line 447, in open
            buffer = _builtin_open(filename, 'rb')
        FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-07s73wh5/orjson/setup.py'
    
    opened by hashkanna 1
  • Adding ESRIServerSource and GeoJSONSource

    Adding ESRIServerSource and GeoJSONSource

    Hey @tjwebb wanted to send these over - still need to do some testing but wanted to run them by you first.

    GeoJSONSource - this one should be fairly straightforward as it is a single file and Fiona can read the file natively

    ESRIServerSource - I added a package that can handle the transformation of ESRI JSON to GeoJSON, as well as loop through a layer request since the ESRI REST API generally limits features that can be requested to 1000 or 2000. I can write some of this code natively or we can use the package, but not sure if we want to limit the dependencies. The package in question is here.

    https://github.com/openaddresses/pyesridump

    Also any tips for testing locally would be great!

    opened by mbforr 1
  • Unable to load 5GB tif file to bigquery

    Unable to load 5GB tif file to bigquery

    It works fine for 1GB tif file. While trying to load 2GB ~ 5GB tif file it is failing with multiple errors during write to bigquery.

    If you would like to reproduce the errors, then you could get these datasets from here - https://files.isric.org/soilgrids/former/2017-03-10/data/ BDRLOG_M_250m_ll.tif OCDENS_M_sl1_250m_ll.tif ORCDRC_M_sl1_250m_ll.tif

    "Error processing instruction process_bundle-1256. Original traceback is Traceback (most recent call last): File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 289, in _execute response = task() File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in lambda: self.create_worker().do_instruction(request), request) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 606, in do_instruction return getattr(self, request_type)( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 644, in process_bundle bundle_processor.process_bundle(instruction_id)) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 999, in process_bundle input_op_by_transform_id[element.transform_id].process_encoded( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 228, in process_encoded self.output(decoded_value) File "apache_beam/runners/worker/operations.py", line 357, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 359, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 829, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/worker/operations.py", line 838, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/common.py", line 1247, in apache_beam.runners.common.DoFnRunner.process_with_sized_restriction File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 886, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1321, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "/usr/local/lib/python3.8/site-packages/future/utils/init.py", line 446, in raise_with_traceback raise exc.with_traceback(traceback) File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) RuntimeError: BrokenPipeError: [Errno 32] Broken pipe [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)-ptransform-3851']

    opened by aswinramakrishnan 1
  • Not able to docker or gcloud submit

    Not able to docker or gcloud submit

    Hi Travis, Thank you for creating geobeam package for our requirement.

    I am raising an issue here to just keep track.

    While using docker build -

     ---> 56341244044b
    Step 9/23 : RUN wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}
     ---> Running in 72bc532c27b8
    The command '/bin/sh -c wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}' returned a non-zero code: 5
    (global-env) [email protected] dataflow-geobeam % 
    (global-env) [email protected] dataflow-geobeam % 
    (global-env) [email protected] dataflow-geobeam % docker image ls                                                                                                        
    REPOSITORY                                                      TAG        IMAGE ID       CREATED         SIZE
    <none>                                                          <none>     56341244044b   5 minutes ago   2.55GB
    

    while trying to do gcloud submit command -

    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp  -fPIC -DPIC -o .libs/BufferOp.o
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp -o BufferOp.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferBuilder.lo -MD -MP -MF .deps/BufferBuilder.Tpo -c BufferBuilder.cpp -o BufferBuilder.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp  -fPIC -DPIC -o .libs/BufferParameters.o
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp -o BufferParameters.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferSubgraph.lo -MD -MP -MF .deps/BufferSubgraph.Tpo -c BufferSubgraph.cpp  -fPIC -DPIC -o .libs/BufferSubgraph.o
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     
    Your build timed out. Use the [--timeout=DURATION] flag to change the timeout threshold.
    ERROR: (gcloud.builds.submit) build ee748b6f-5347-4061-a81b-7f46959086c5 completed with status "TIMEOUT"
    
    documentation customer-reported 
    opened by aswinramakrishnan 1
  • Docstring type of fn.format_record is type but takes in string

    Docstring type of fn.format_record is type but takes in string

    Hi! Cool project that I want to test out but I noticed an inconsistency with the docstring and the code. Not sure which should be followed.

    https://github.com/GoogleCloudPlatform/dataflow-geobeam/blob/21479252be373b795a5c7d6626021b01a042e5de/geobeam/fn.py#L67-L91

    The docstring should be

            band_type (str, optional): Default to int. The data type of the
                raster band column to store in the database.
    ...
            p | beam.Map(geobeam.fn.format_record,
                band_column='elev', band_type='float'
    

    or the code should be

    def format_record(element, band_column=None, band_type=int):
        import json
    
        props, geom = element
        cast = band_type
    

    Thanks!

    opened by jtmiclat 1
  • Create BQ table from shapefile

    Create BQ table from shapefile

    Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

    opened by Vadoid 0
  • [BEAM-12879] Issue affecting examples

    [BEAM-12879] Issue affecting examples

    [BEAM-12879] Issue - https://issues.apache.org/jira/browse/BEAM-12879?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

    Affecting the examples when reading files from the gs://geobeam bucket. Workaround - download zip files and put them in the own bucket.

    opened by Vadoid 0
  • Create BQ table from SHP schema

    Create BQ table from SHP schema

    Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

    opened by Vadoid 0
  • Geobeam Util  get_bigquery_schema_dataflow only for GDB files

    Geobeam Util get_bigquery_schema_dataflow only for GDB files

    As far as I understand current implementation of get_bigquery_schema_dataflow only works for GDB files and doesn't work for SHP files. Should we make the autoschema work for shapefiles as well via Fiona?

    opened by Vadoid 0
  • Updated the following: Dockerfile to get more recent gdal version

    Updated the following: Dockerfile to get more recent gdal version

    beam sdk 2.36.0 --> beam sdk 2.40.0 Addition of metview binary with the fix for debian curl 7.73.0 --> curl 7.83.1 (open ssl needed) geos 3.9.0 --> curl 3.10.3 sqlite 3330000 --> sqlite 3380500 proj 7.2.1 --> proj 9.0.0 (using cmake) openjpeg 2.3.1 --> openjpeg 2.5.0 addition of hdf5 1.10.5 addition of netcdf-c 4.9.0 gdal 3.2.1 --> gdal 3.5.1 (using cmake) making sure numpy always gets installed with gdal addition gcloud components alpha

    added longer timeout to cloudbuild.yaml because the intial build takes 1h20min at least

    opened by bahmandar 0
  • get_bigquery_schema_dataflow() issue and questions

    get_bigquery_schema_dataflow() issue and questions

    Hi, I am trying to use geobeam to ingest a shapefile into BigQuery, and creating the table with a schema from the shapefile if the table does not exist. I came across few issues and questions.

    I attempt this using a modified example shapefile_nfhl.py. And ran with this command.

    python -m shapefile_nfhl --runner DataflowRunner --project my-project --temp_location gs://mybucket-geobeam/data --region australia-southeast1 --worker_harness_container_image gcr.io/dataflow-geobeam/example --experiment use_runner_v2 --service_account_email [email protected] --gcs_url gs://geobeam/examples/510104_20170217.zip --dataset examples --table output_table
    

    Using get_bigquery_schema_dataflow() from geobeam.util is throwing error due to undefined variable.

    NameError: name 'gcs_url' is not defined
    

    I have opened a PR to fix this. #38

    Once the function is fixed, it seems that it does not accept a shapefile. Passing in the GCS URL to the zipped shapefile is throwing this error.

    Traceback (most recent call last):
      File "fiona/_shim.pyx", line 83, in fiona._shim.gdal_open_vector
      File "fiona/_err.pyx", line 291, in fiona._err.exc_wrap_pointer
    fiona._err.CPLE_OpenFailedError: '/vsigs/geobeam/examples/510104_20170217.zip' not recognized as a supported file format.
    

    Am I using the function in a wrong way or (zipped) shapefile is not support for this? For reference, this is the modified template. Thank you!

    opened by muazamkamal 0
  • centroid_only = false error for a particular GeoTIFF dataset

    centroid_only = false error for a particular GeoTIFF dataset

    When ingesting this cropland dataset https://developers.google.com/earth-engine/datasets/catalog/USDA_NASS_CDL?hl=en#citations: if I set the centroid_only parameter to false, I get the following error: Failed 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 319; errors: 1. Please look into the errors[] collection for more details.' reason: 'invalid'> [while running 'WriteToBigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs-ptransform-31'

    Full steps to reproduce are in the 'Ingesting EE data to BQ' blog.

    opened by remylouisew 0
Releases(v0.1.0)
Owner
Google Cloud Platform
Google Cloud Platform
A multi-page streamlit app for the geospatial community.

A multi-page streamlit app for the geospatial community.

Qiusheng Wu 522 Dec 30, 2022
Implementation of Trajectory classes and functions built on top of GeoPandas

MovingPandas MovingPandas implements a Trajectory class and corresponding methods based on GeoPandas. Visit movingpandas.org for details! You can run

Anita Graser 897 Jan 01, 2023
A service to auto provision devices in Aruba Central based on the Geo-IP location

Location Based Provisioning Service for Aruba Central A service to auto provision devices in Aruba Central based on the Geo-IP location Geo-IP auto pr

Will Smith 3 Mar 22, 2022
pure-Python (Numpy optional) 3D coordinate conversions for geospace ecef enu eci

Python 3-D coordinate conversions Pure Python (no prerequistes beyond Python itself) 3-D geographic coordinate conversions and geodesy. API similar to

Geospace code 292 Dec 29, 2022
:earth_asia: Python Geocoder

Python Geocoder Simple and consistent geocoding library written in Python. Table of content Overview A glimpse at the API Forward Multiple results Rev

Denis 1.5k Jan 02, 2023
A package built to support working with spatial data using open source python

EarthPy EarthPy makes it easier to plot and manipulate spatial data in Python. Why EarthPy? Python is a generic programming language designed to suppo

Earth Lab 414 Dec 23, 2022
Spatial Interpolation Toolbox is a Python-based GUI that is able to interpolate spatial data in vector format.

Spatial Interpolation Toolbox This is the home to Spatial Interpolation Toolbox, a graphical user interface (GUI) for interpolating geographic vector

Michael Ward 2 Nov 01, 2021
Implemented a Google Maps prototype that provides the shortest route in terms of distance

Implemented a Google Maps prototype that provides the shortest route in terms of distance, the fastest route, the route with the fewest turns, and a scenic route that avoids roads when provided a sou

1 Dec 26, 2021
ProjPicker (projection picker) is a Python module that allows the user to select all coordinate reference systems (CRSs)

ProjPicker ProjPicker (projection picker) is a Python module that allows the user to select all coordinate reference systems (CRSs) whose extent compl

Huidae Cho 4 Feb 06, 2022
Pure python WMS

Ogcserver Python WMS implementation using Mapnik. Depends Mapnik = 0.7.0 (and python bindings) Pillow PasteScript WebOb You will need to install Map

Mapnik 130 Dec 28, 2022
A proof-of-concept jupyter extension which converts english queries into relevant python code

Text2Code for Jupyter notebook A proof-of-concept jupyter extension which converts english queries into relevant python code. Blog post with more deta

DeepKlarity 2.1k Dec 29, 2022
This repository contains the scripts to derivate the ENU and ECEF coordinates from the longitude, latitude, and altitude values encoded in the NAD83 coordinates.

This repository contains the scripts to derivate the ENU and ECEF coordinates from the longitude, latitude, and altitude values encoded in the NAD83 coordinates.

Luigi Cruz 1 Feb 07, 2022
Imperial Valley Geomorphology Map

Roughly maps the extent of basins, basin edges, and mountains in the Imperial Valley by grouping terrain classes from the Iwahashi et al. 2021 California terrian classification model.

0 Dec 13, 2022
Python package for earth-observing satellite data processing

Satpy The Satpy package is a python library for reading and manipulating meteorological remote sensing data and writing it to various image and data f

PyTroll 882 Dec 27, 2022
Zora is a python program that searches for GeoLocation info for given CIDR networks , with options to search with API or without API

Zora Zora is a python program that searches for GeoLocation info for given CIDR networks , with options to search with API or without API Installing a

z3r0day 1 Oct 26, 2021
a Geolocator made in python

Geolocator A Geolocator made in python ✨ Features locates ur location using ur ip thats it! 💁‍♀️ How to use first download the locator.py file instal

Portgas D Ace 1 Oct 27, 2021
Automated download of LANDSAT data from USGS website

LANDSAT-Download It seems USGS has changed the structure of its data, and so far, I have not been able to find the direct links to the products? Help

Olivier Hagolle 197 Dec 30, 2022
Daily social mapping project in November 2021. Maps made using PyGMT whenever possible.

Daily social mapping project in November 2021. Maps made using PyGMT whenever possible.

Wei Ji 20 Nov 24, 2022
Constraint-based geometry sketcher for blender

Geometry Sketcher Constraint-based sketcher addon for Blender that allows to create precise 2d shapes by defining a set of geometric constraints like

1.7k Jan 02, 2023
Raster-based Spatial Analysis for Python

🌍 xarray-spatial: Raster-Based Spatial Analysis in Python 📍 Fast, Accurate Python library for Raster Operations ⚡ Extensible with Numba ⏩ Scalable w

makepath 649 Jan 01, 2023