Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Overview

Amundsen

Slack

Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.

LF AI & Data

Amundsen is hosted by the LF AI & Data Foundation. It includes three microservices, one data ingestion library and one common library.

  • amundsenfrontendlibrary: Frontend service which is a Flask application with a React frontend.
  • amundsensearchlibrary: Search service, which leverages Elasticsearch for search capabilities, is used to power frontend metadata searching.
  • amundsenmetadatalibrary: Metadata service, which leverages Neo4j or Apache Atlas as the persistent layer, to provide various metadata.
  • amundsendatabuilder: Data ingestion library for building metadata graph and search index. Users could either load the data with a python script with the library or with an Airflow DAG importing the library.
  • amundsencommon: Amundsen Common library holds common codes among microservices in Amundsen.
  • amundsengremlin: Amundsen Gremlin library holds code used for converting model objects into vertices and edges in gremlin. It's used for loading data into an AWS Neptune backend.
  • amundsenrds: Amundsenrds contains ORM models to support relational database as metadata backend store in Amundsen. The schema in ORM models follows the logic of databuilder models. Amundsenrds will be used in databuilder and metadatalibrary for metadata storage and retrieval with relational databases.

Homepage

Documentation

Requirements

  • Python = 3.6 or 3.7
  • Node = v10 or v12 (v14 may have compatibility issues)
  • npm >= 6

User Interface

Please note that the mock images only served as demonstration purpose.

  • Landing Page: The landing page for Amundsen including 1. search bars; 2. popular used tables;

  • Search Preview: See inline search results as you type

  • Table Detail Page: Visualization of a Hive / Redshift table

  • Column detail: Visualization of columns of a Hive / Redshift table which includes an optional stats display

  • Data Preview Page: Visualization of table data preview which could integrate with Apache Superset or other Data Visualization Tools.

Get Involved in the Community

Want help or want to help? Use the button in our header to join our slack channel. Contributions are also more than welcome! As explained in CONTRIBUTING.md there are many ways to contribute, it does not all have to be code with new features and bug fixes, also documentation, like FAQ entries, bug reports, blog posts sharing experiences etc. all help move Amundsen forward. If you find a security vulnerability, please follow this guide.

Getting Started

Please visit the Amundsen installation documentation for a quick start to bootstrap a default version of Amundsen with dummy data.

Architecture Overview

Please visit Architecture for Amundsen architecture overview.

Supported Entities

  • Tables (from Databases)
  • People (from HR systems)
  • Dashboards

Supported Integrations

Table Connectors

Amundsen can also connect to any database that provides dbapi or sql_alchemy interface (which most DBs provide).

Dashboard Connectors

ETL Orchestration

BI Viz Tool

Installation

Please visit Installation guideline on how to install Amundsen.

Roadmap

Please visit Roadmap if you are interested in Amundsen upcoming roadmap items.

Blog Posts and Interviews

Talks

  • Disrupting Data Discovery {slides, recording} (Strata SF, March 2019)
  • Amundsen: A Data Discovery Platform from Lyft {slides} (Data Council SF, April 2019)
  • Disrupting Data Discovery {slides} (Strata London, May 2019)
  • ING Data Analytics Platform (Amundsen is mentioned) {slides, recording } (Kubecon Barcelona, May 2019)
  • Disrupting Data Discovery {slides, recording} (Making Big Data Easy SF, May 2019)
  • Disrupting Data Discovery {slides, recording} (Neo4j Graph Tour Santa Monica, September 2019)
  • Disrupting Data Discovery {slides} (IDEAS SoCal AI & Data Science Conference, Oct 2019)
  • Data Discovery with Amundsen by Gerard Toonstra from Coolblue {slides} and {talk} (BigData Vilnius 2019)
  • Towards Enterprise Grade Data Discovery and Data Lineage with Apache Atlas and Amundsen by Verdan Mahmood and Marek Wiewiorka from ING {slides, talk} (Big Data Technology Warsaw Summit 2020)
  • Airflow @ Lyft (which covers how we integrate Airflow and Amundsen) by Tao Feng {slides and website} (Airflow Summit 2020)
  • Data DAGs with lineage for fun and for profit by Bolke de Bruin {website} (Airflow Summit 2020)

Related Articles

Community meetings

Community meetings are held on the first Thursday of every month at 9 AM Pacific, Noon Eastern, 6 PM Central European Time. Link to join

Upcoming meetings & notes

You can the exact date for the next meeting and the agenda a few weeks before the meeting in this doc.

Notes from all past meetings are available here.

Who uses Amundsen?

Here is the list of organizations that are using Amundsen today. If your organization uses Amundsen, please file a PR and update this list.

Currently officially using Amundsen:

  1. Asana
  2. Bagelcode
  3. Bang & Olufsen
  4. Brex
  5. Cameo
  6. Cimpress Technology
  7. Coles Group
  8. Convoy
  9. Databricks
  10. Data Sprints
  11. Dcard
  12. Devoted Health
  13. DHI Group
  14. Edmunds
  15. Everfi
  16. Gusto
  17. Hurb
  18. ING
  19. Instacart
  20. iRobot
  21. Lett
  22. LMC
  23. Loft
  24. Lyft
  25. Merlin
  26. PicPay
  27. Plarium Krasnodar
  28. PUBG
  29. Rapido
  30. REA Group
  31. Remitly
  32. Square
  33. Tile
  34. WeTransfer
  35. Workday

License

Apache 2.0 License.

Comments
  • Programmatic and Manual pathways for table and column descriptions

    Programmatic and Manual pathways for table and column descriptions

    Overview

    As a data engineer, there are quite a few properties that we can extract programmatically that are currently not supported by amundsen as first class properties in the UI. For some of these properties that are likely widely useful, I can understand creating issues to ingest them so that they appear in the panel on the right. However, some of these properties are very company specific and wouldn't be needed by other companies. Therefore, I think that it would be useful to allow users to ingest structured data without needing to make changes to amundsen infrastructure.

    What do we do now? And why it won't work in long run?

    Currently, we get around this by programmatically updating the table description with prepared markdown, however in the long run we also want users to be able to edit table and column descriptions through amundsen which will put us in a bind. We no longer would able to update programmatically without a lot of added complexity concerning reconciliation and merging of changes.

    Proposal

    My proposal is that there are two types of descriptions for tables and columns

    One would be programmatic and cannot be modified manually. The other would be the current description. By default, the programmatic description would not appear on the page unless it is populated.

    Note about column level

    This is true on the column level as well, where we may want to include company specific attributes, but I can understand why column level programmatic descriptions maybe too much to ask for in terms of cluttering the UI.

    type:feature status:completed 
    opened by samshuster 35
  • feat: Upgrade feast to 0.17

    feat: Upgrade feast to 0.17

    Summary of Changes

    Upgrade feast extractor with new Feast architecture

    Tests

    • Update old tests and remove deprecated ones

    CheckList

    Make sure you have checked all steps below to ensure a timely review.

    • [x] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
    • [x] PR includes a summary of changes.
    • [x] PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
    • [x] In case of new functionality, my PR adds documentation that describes how to use it.
      • All the public functions and the classes in the PR contain docstrings that explain what it does
    keep fresh area:common area:databuilder area:all 
    opened by amommendes 29
  • OIDC authentication - session not persists issue

    OIDC authentication - session not persists issue

    I'm trying to implement Google OIDC authentication on an existing amundsen. All the micro services (search, metadata, frontend) are deployed with helm on AWS EKS and, without authentication, it works. But when I enable OIDC, as you'll see below, session information does not persist and frontend starts showing a "Something went wrong..." message.

    Here's the related logs from frontend. (sensitive info replaced with arbitrary number of *'s + I put the 'test_data' in session for debugging purposes)

    2021-08-13T17:23:20+09:00 /usr/local/lib/python3.7/site-packages/flask/json/__init__.py:179: DeprecationWarning: Importing 'itsdangerous.json' is deprecated and will be removed in ItsDangerous 2.1. Use Python's 'json' module instead.
    2021-08-13T17:23:20+09:00   rv = _json.dumps(obj, **kwargs)
    2021-08-13T17:23:20+09:00 /usr/local/lib/python3.7/site-packages/flask/json/__init__.py:205: DeprecationWarning: Importing 'itsdangerous.json' is deprecated and will be removed in ItsDangerous 2.1. Use Python's 'json' module instead.
    2021-08-13T17:23:20+09:00   return _json.loads(s, **kwargs)
    2021-08-13T17:23:20+09:00 2021-08-13T08:23:20+0000.168 [DEBUG] __init__._before_request:23 (11:MainThread) - Whitelisted Endpoint: status,healthcheck,health,logout
    2021-08-13T17:23:20+09:00 2021-08-13T08:23:20+0000.169 [ERROR] models._fetch_token:88 (11:MainThread) - Calling _fetch_token(name=google)...
    2021-08-13T17:23:20+09:00 2021-08-13T08:23:20+0000.169 [ERROR] models._fetch_token:92 (11:MainThread) - <SecureCookieSession {'test_data': 'test data', 'user': {'__id': '*****@*************.com', 'at_hash': '**************', 'aud': '***********************', 'azp': '*****************', 'display_name': '******************', 'email': '*****************', 'email_verified': True, 'exp': 1628846069, 'family_name': '******', 'given_name': '***********', 'hd': '***************, 'iat': 1628842469, 'iss': 'https://accounts.google.com', 'locale': 'en', 'name': '************', 'nonce': '*************', 'picture': '****************', 'profile_url': '', 'sub': '**********', 'user_id': '*****************'}}>
    2021-08-13T17:23:32+09:00 /usr/local/lib/python3.7/site-packages/flask/json/__init__.py:179: DeprecationWarning: Importing 'itsdangerous.json' is deprecated and will be removed in ItsDangerous 2.1. Use Python's 'json' module instead.
    2021-08-13T17:23:32+09:00   rv = _json.dumps(obj, **kwargs)
    2021-08-13T17:23:32+09:00 /usr/local/lib/python3.7/site-packages/flask/json/__init__.py:179: DeprecationWarning: Importing 'itsdangerous.json' is deprecated and will be removed in ItsDangerous 2.1. Use Python's 'json' module instead.
    2021-08-13T17:23:32+09:00   rv = _json.dumps(obj, **kwargs)
    2021-08-13T17:23:32+09:00 /usr/local/lib/python3.7/site-packages/flask/json/__init__.py:179: DeprecationWarning: Importing 'itsdangerous.json' is deprecated and will be removed in ItsDangerous 2.1. Use Python's 'json' module instead.
    2021-08-13T17:23:32+09:00   rv = _json.dumps(obj, **kwargs)
    2021-08-13T17:23:32+09:00 2021-08-13T08:23:32+0000.720 [DEBUG] __init__._before_request:23 (10:MainThread) - Whitelisted Endpoint: status,healthcheck,health,logout
    2021-08-13T17:23:32+09:00 2021-08-13T08:23:32+0000.720 [ERROR] models._fetch_token:88 (10:MainThread) - Calling _fetch_token(name=google)...
    2021-08-13T17:23:32+09:00 2021-08-13T08:23:32+0000.720 [ERROR] models._fetch_token:92 (10:MainThread) - <SecureCookieSession {'test_data': 'test data'}>
    2021-08-13T17:23:32+09:00 2021-08-13T08:23:32+0000.721 [ERROR] models._fetch_token:100 (10:MainThread) - 'user'
    2021-08-13T17:23:32+09:00 2021-08-13T08:23:32+0000.721 [ERROR] __init__._before_request:59 (10:MainThread) - User not logged in, redirecting to auth
    2021-08-13T17:23:32+09:00 Traceback (most recent call last):
    2021-08-13T17:23:32+09:00   File "/usr/local/lib/python3.7/site-packages/flaskoidc/models.py", line 94, in _fetch_token
    2021-08-13T17:23:32+09:00     user_id=session["user"]["__id"],
    2021-08-13T17:23:32+09:00   File "/usr/local/lib/python3.7/site-packages/flask/sessions.py", line 83, in __getitem__
    2021-08-13T17:23:32+09:00     return super(SecureCookieSession, self).__getitem__(key)
    2021-08-13T17:23:32+09:00 KeyError: 'user'
    

    My suspicions are:

    1. Google OIDC has different specs, for example, in my setup, OAuth2Token (flaskoidc) saves access token, refresh token and such, instead of name or user_id. So I tweaked the flaskoidc lib to authenticate the current session user using some google apis, but still the error persists.
    2. Metadata also produces suspicious logs. Except the healthcheck signals, it always returns 302. Is it meant to work like this? I'm not sure.

    Expected Behavior

    meant to work out of the box?

    Current Behavior

    frontend shows 'something went wrong' messages after login when hooked with google oidc

    Possible Solution

    Steps to Reproduce

    1. deploy amundsen using the helm chart in gith repo using frontend-oidc: 3.11.1, metadata-oidc: 3.5.0, search:2.5.1 with flask OIDC env variables for Google oidc
    opened by woodchuck1206 28
  • Python setup.py egg_info did not run successfully.

    Python setup.py egg_info did not run successfully.

    I'm trying to install Amundsen on docker running on Windows 10. I'm getting an error while run a docker-compose using atlas.

    Expected Behavior

    Install success

    Current Behavior

    Getting erro when execute: docker-compose -f docker-amundsen-atlas.yml up

    Steps to Reproduce

    1. Clone the repo
    2. Execute: docker-compose -f docker-amundsen-atlas.yml up (get an error in this step)
    • Error: error: subprocess-exited-with-error

      × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [10 lines of output] /bin/sh: 1: npm: not found ERROR:root:npm must be available /bin/sh: 1: npm: not found /app/setup.py:30: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead logging.warn('Installation of npm dependencies failed') WARNING:root:Installation of npm dependencies failed /app/setup.py:31: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead logging.warn(str(e)) WARNING:root:Command '['npm install']' returned non-zero exit status 127. error in amundsen-frontend setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers. [end of output]

      note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

    × Encountered error while generating package metadata. ╰─> See above for output.

    note: This is an issue with the package mentioned above, not pip. hint: See above for details. 1 error occurred: * Status: The command '/bin/sh -c pip3 install -e .' returned a non-zero code: 1, Code: 1

    Screenshots (if appropriate)

    Screenshot_7

    Context

    Your Environment

    • Amunsen version used: 6.7.4
    • Data warehouse stores: none yet
    • Deployment (k8s or native): native
    • Link to your fork or repository:
    opened by FelipeArruda 25
  • More support for get_user_details

    More support for get_user_details

    Expected Behavior or Use Case

    In the docs, it suggests to give USER_DETAIL_METHOD = get_user_details within the metadata_service config. After defining the function given in the docs:

    def get_user_details(user_id):
        user_info = {
            'email': 'email',
            'user_id': user_id,
            'first_name': 'Firstname',
            'last_name': 'Lastname',
            'full_name': 'Firstname Lastname',
        }
        return user_info
    

    The only info attached to the user is the email and the users are created with names Firstname Lastname. We're able to properly authenticate the user based on their email with Okta, but we are unable to create the user with their proper names.

    Possible Implementation

    In order for more info to be available for the function, the function header needs to change and accept other arguments like a json of the header. Something like:

    def get_user_details(data):
        user_info = {
            'email': data['email'],
            'user_id': data['user_id'],
            'first_name': data['first_name'],
            'last_name': data['last_name'],
            'full_name': data['first_name']+" "+data['last_name'],
        }
        return user_info
    

    where data is a json object with the request header info.

    If this is implemented, then all of the function calls will need to be changed so that the correct info is passed along. We would need to fork the metadata submodule, build our own image and test our changes. After finishing testing, we'd push back upstream.

    Context

    This is needed to properly create users with their correct names.

    I was following #852 but opened a new issue to provide more context and its current state.

    opened by alldoami 25
  • Feature Proposal: Search service ElasticSearch AWS (and potentially other) authentication support

    Feature Proposal: Search service ElasticSearch AWS (and potentially other) authentication support

    Currently ES Proxy in Search service allows for simple user-password pair, non SSL setup. Whereas, in our usage we would like to use AWS Elasticsearch Service which requires specific ES client initialisation, which currently requires code modification and injection of another Elasticsearch client.

    Expected Behavior or Use Case

    User can setup in config using env variables:

    PROXY_CLIENT=ELASTICSEARCH_AWS
    CREDENTIALS_PROXY_USER=aws_access_key
    CREDENTIALS_PROXY_PASSWORD=aws_secret_key
    

    Service or Ingestion ETL

    amundsen-search service.

    Possible Implementation

    The change would be mostly to the initialisation of ES Proxy, eg.

    # Current ES Proxy would require ES client, without any other changes to the class business logic
    class ElasticsearchProxy(BaseProxy):
        """
        ElasticSearch connection handler
        """
    
        def __init__(self, *,
                     client: Elasticsearch = None,
                     page_size: int = 10
                     ) -> None:
            """
            Constructs Elasticsearch client for interactions with the cluster.
            Allows caller to pass a fully constructed Elasticsearch client, {elasticsearch_client}.
    
            :param elasticsearch_client: Elasticsearch client to use, if provided
            :param  page_size: Number of search results to return per request
            """
            self.elasticsearch = client
    
            self.page_size = page_size
    ...
    
    # Old implementation would only be for creating simple ES client but most of the logic is still in ElasticsearchProxy; 
    # can be setup as before by env variable PROXY_CLIENT=ELASTICSEARCH
    class SimpleElasticsearchProxy(ElasticsearchProxy):
        """
        ElasticSearch connection handler
        """
    
        def __init__(self, *,
                     host: str = None,
                     user: str = '',
                     password: str = '',
                     page_size: int = 10
                     ) -> None:
            """
            Constructs simple Elasticsearch client from the parameters provided.
    
            :param host: Elasticsearch host we should connect to
            :param user: user name to use for authentication
            :param password: user password to use for authentication
            :param  page_size: Number of search results to return per request
            """
            http_auth = (user, password) if user else None
            client = Elasticsearch(host, http_auth=http_auth)
    
            super().__init__(client=client, page_size=page_size)
    
    # AWS ES Proxy connector can be setup via env variable PROXY_CLIENT=ELASTICSEARCH_AWS
    class AwsElasticsearchProxy(ElasticsearchProxy):
        """
        ElasticSearch connection handler
        """
    
        def __init__(self, *,
                     host: str = None,
                     user: str = '',
                     password: str = '',
                     page_size: int = 10
                     ) -> None:
            """
            Constructs simple Elasticsearch client from the parameters provided.
    
            :param host: Elasticsearch host we should connect to
            :param user: AWS access key
            :param password: AWS secret key
            :param  page_size: Number of search results to return per request
            """
            region = os.environ.get('AWS_REGION')
            awsauth = AWS4Auth(user, password, region, 'es')
    
            client = Elasticsearch(
                hosts=[{'host': host, 'port': 443}],
                http_auth=awsauth,
                use_ssl=True,
                verify_certs=True,
                connection_class=RequestsHttpConnection
            )
    
            super().__init__(client=client, page_size=page_size)
    

    Context

    This would allow to have different implementation of ES and other proxy client, which can be selected via configuration and would not require witting new code and manual docker image building.

    status:completed area:search 
    opened by jsnowacki 24
  • open source amundsen neo4j backup scripts

    open source amundsen neo4j backup scripts

    AC

    • there will be scripts provided that allow amundsen neo4j data to be backed up (on a schedule) to cloud provider blob storage. aws s3 makes the most sense, and if others need other providers (e.g. azure), then they can provide an extension to this functionality
    • once these scripts are established, we should extended them to the k8s setup as well
    keep fresh 
    opened by javamonkey79 24
  • Would like a guide for How-To deploy Amundsen in production

    Would like a guide for How-To deploy Amundsen in production

    Please add points on what you expect from such a guide in a comment below. I will then try to consolidate input and draft up an outline in this comment.

    The guide can end up as ~~/docs/deployment.md~~ is /docs/owners_manual.md better?

    Initial outline:

    • [ ] Basic install of services (in different environments)

      • [x] Docker-compose “vanilla”, but with Gunicorn (WIP #109) ~~data in volumes etc.~~
      • [ ] AWS ECS. original PR: https://github.com/lyft/amundsenfrontendlibrary/pull/216 (or EC2): https://github.com/lyft/amundsenfrontendlibrary/issues/186
      • [x] Kubernetes helm chart install ~~(convert from Compose using https://kompose.io?)~~ (upcoming PR see https://github.com/lyft/amundsen/issues/53#issuecomment-538575978 below)
    • [ ] Setting up ingest (with or without Airflow, see https://github.com/lyft/amundsen/issues/53#issuecomment-617370073)

      Figure out which parts of this belongs with Architecture.md and which in Databuilder repo?

      • [ ] Compared to Quickstart ingest (https://github.com/lyft/amundsen/issues/75)
      • [ ] Then mention source by source; Extractor(s), Model, Metadata - Table Metadata: - Users - Table Usage: (How it works and why in https://github.com/lyft/amundsen/issues/381#issuecomment-613387814) - ...
    • [ ] Configuration - custom build of frontend (to not have to maintain a fork we need to get https://github.com/lyft/amundsen/issues/408 transmogrified into proper documentation/tooling)

      • [x] Small tweaks to turn on/off features, adding logo etc. (mostly Done) https://github.com/lyft/amundsenfrontendlibrary/commit/c256115f7d64da121de4ea36ea9c55592c11f9d5 in PR https://github.com/lyft/amundsenfrontendlibrary/pull/255
      • [x] Config of email notification/feedback Done in PR https://github.com/lyft/amundsenfrontendlibrary/pull/291
      • [x] Data preview (integration to SuperSet) - https://github.com/lyft/amundsen/issues/27#issuecomment-517477074 has some draft contextual lead in and reasoning and a link to example setup. But ultimately what ticks off the box for this is Taos Guide in https://github.com/lyft/amundsen/blob/master/docs/tutorials/data-preview-with-superset.md (or on the https://lyft.github.io/amundsen/ site, search for SuperSet!)
    • [ ] Security

    • [x] Backup - initial WiP in https://github.com/lyft/amundsen/issues/53#issuecomment-516159598 below ... current result in https://github.com/lyft/amundsen/issues/381#issuecomment-614534794 - and restore (on K8s) implemented in https://github.com/lyft/amundsen/pull/394

    • [ ] Monitoring (statsd etc.?)

    • [ ] Handling upgrades

    • [ ] ....

    type:documentation status:needs_votes area:all 
    opened by jornh 23
  • Neo4jCsvPublisher Speed Optimization (Parallelism)

    Neo4jCsvPublisher Speed Optimization (Parallelism)

    Hi Team, I’m wondering if there’s a plan to apply multiprocessing on the publishers. We have a large amount of metadata in our production, which ended up running 3 million queries on neo4j . It takes about 90 minutes to finish.

    To investigate the bottleneck, I looked into the code and logged the time elapsed for each step in a single iteration in the _publish_node function. This is the result

    • Neo4j query: 0.1ms
    • Create statement: 1ms
    • Others: super fast, doesn’t matter image

    Surprisingly, the bottleneck is not the db query, it’s the statement creation. The process is basically

    1. loop each row in csv
    2. parse the row into a dictionary
    3. loop through each key value pair in the dictionary to get the props
    4. fill the statement Jinjia template with the props
    5. execute the query with the statement

    I’m thinking that instead of read a row => create a node in graph db one by one, maybe we could use multiprocessing to speed up the process. I believe there will be no dependency issue as long as we publish all the nodes before publishing relations, which is already handled in the current codebase. I’m planning on implementing multiprocessing for this, is there any potential problem? Like dependency, graph db load, etc..

    Expected Behavior or Use Case

    Speed up the performance of the publisher. Currently, a 90 min sync is not acceptable for our use case 😢

    Service or Ingestion ETL

    Ingestion ETL, publisher

    Possible Implementation

    Thanks to @dkunitsk 's idea, I think there are three possible implementations

    1. Multiprocessing on call side
    2. Multiprocessing on Neo4j publisher
    3. Neo4j UNWIND (Batch processing)

    image image image

    class HiveParallelIndexer:
        # Shim for adding all node labels to the NEO4J_DEADLOCK_NODE_LABELS config
        # which enables retries for those node labels. This is important for parallel writing
        # since we see intermittent Neo4j deadlock errors relatively often.
        class ContainsAllList(list):
            def __contains__(self, item):
                return True
    
        def __init__(self, publish_tag: str, parallelism: int):
            self.publish_tag = publish_tag
            self.parallelism = parallelism
    
        def __call__(self, worker_index: int):
            # Sharding:
            #   - take the md5 hash of the schema.table_name
            #   - convert the first 3 characters of the hash to decimal (3 chosen arbitrarily)
            #   - mod by total number of processes
            where_clause_suffix = """
                WHERE MOD(CONV(LEFT(MD5(CONCAT(d.NAME, '.', t.TBL_NAME)), 3), 16, 10), {total_parallelism}) = {worker_index}
                AND t.TBL_TYPE IN ('EXTERNAL_TABLE', 'MANAGED_TABLE', 'VIRTUAL_VIEW')
                AND (t.VIEW_EXPANDED_TEXT != '/* Presto View */' OR t.VIEW_EXPANDED_TEXT is NULL)
            """.format(total_parallelism=self.parallelism,
                worker_index=worker_index)
    
            # configs relevant for multiprocessing
            job_config = ConfigFactory.from_dict({
                'extractor.hive_table_metadata.{}'.format(HiveTableMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY):
                    where_clause_suffix,
                # keeping this relatively low, in our experience, reduces neo4j deadlocks
                'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_TRANSACTION_SIZE):
                    100,
                'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_DEADLOCK_NODE_LABELS):
                    HiveParallelIndexer.ContainsAllList(),
            })
            job = DefaultJob(conf=job_config,
                             task=DefaultTask(
                                 extractor=HiveTableMetadataExtractor(),
                                 loader=FsNeo4jCSVLoader()),
                             publisher=Neo4jCsvPublisher())
            job.launch()
    
    
    parallelism = 16
    indexer = HiveParallelIndexer(
        publish_tag='2021-12-03'
        parallelism=parallelism)
    
    with multiprocessing.Pool(processes=parallelism) as pool:
        def callback(_):
            # fast fail in case of exception in any process
            print('terminating due to exception')
            pool.terminate()
        res = pool.map_async(indexer, [i for i in range(parallelism)], error_callback=callback)
        res.get()
    

    Screenshots of Slack Discussion

    image image

    type:feature status:needs_votes area:databuilder 
    opened by chonyy 22
  • Feature Proposal: Editable Custom Table Attributes

    Feature Proposal: Editable Custom Table Attributes

    Users of Amundsen occasionally want to display table level attributes on the table detail page that are specific to their business or data source. Unlike programmatic descriptions, they also want to be able to edit this information directly in the UI as these attributes are typically human-generated.

    Some examples include:

    • retention policy
    • data usage policy

    Expected Behavior or Use Case

    Users can display and edit additional custom attributes in the table detail page.

    Service or Ingestion ETL

    frontend and metadata services

    Possible Implementation

    Define additional custom table attributes via configuration in the frontend. The custom table attributes are then displayed in the table detail page using the EditableText component. Custom attributes are persisted to the graph using the metadata service and a new PATCH table API endpoint.

    Example Screenshots (if appropriate):

    Context

    This would allow users to add human-generated business-specific metadata to Amundsen and maintain it directly in the UI.

    opened by jkulzick 22
  • Recieve error

    Recieve error "Something went wrong..." when upload data from PostgreSQL database

    I have uploaded information about my postgresql tables. When I try to see Information about tables I see error "Something went wrong... "

    Expected Behavior

    I want to see metadata information about tables.

    Current Behavior

    I can see that upload was succesfull because information about tables availible in neo4j store.

    I have used docker-amundsen.yml script for deloyment. In amundsenfrontend container logs recieve this error

    2021-03-14T13:11:33+0000.463 [ERROR] v0._get_table_metadata:149 (11:MainThread) - Encountered exception: {'columns': {0: {'col_type': ['Field may not be null.']}, 1: {'col_type': ['Field may not be null.']}}} Traceback (most recent call last): File "/app/amundsen_application/api/metadata/v0.py", line 143, in _get_table_metadata results_dict['tableData'] = marshall_table_full(table_data_raw) File "/app/amundsen_application/api/utils/metadata_utils.py", line 109, in marshall_table_full table: Table = schema.load(table_dict).data File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 588, in load result, errors = self._do_load(data, many, partial=partial, postprocess=True) File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 711, in _do_load raise exc marshmallow.exceptions.ValidationError: {'columns': {0: {'col_type': ['Field may not be null.']}, 1: {'col_type': ['Field may not be null.']}}} 2021-03-14T13:11:33+0000.463 [DEBUG] action_log_callback.on_post_execution:70 (11:MainThread) - Calling callbacks: [<function logging_action_log at 0x7f26506bf8c0>] 2021-03-14T13:11:33+0000.464 [DEBUG] action_log_callback.logging_action_log:85 (11:MainThread) - logging_action_log: ActionLogParams(command='_get_table_metadata', start_epoch_ms=1615727493322, end_epoch_ms=1615727493463, user='[email protected]', host_name='ee9f995f6b9d', pos_args_json='[]', keyword_args_json='{"table_key": "postgres://ioekgftt.public/demo", "index": "0", "source": "search_results"}', output='{"tableData": {}, "msg": "Encountered exception: {\'columns\': {0: {\'col_type\': [\'Field may not be null.\']}, 1: {\'col_type\': [\'Field may not be null.\']}}}", "status_code": 500}', error=None)

    Steps to Reproduce

    1. Deploy amundsen on docker
    2. Use sample_postgres_loader.py to upload data

    Your Environment

    • Amunsen version used: amundsen-frontend:3.1.0, amundsen-search:2.4.1, amundsen-metadata:3.3.0
    • Data warehouse stores: postgresql
    • Deployment (k8s or native): native(docker)
    opened by Arkronus 21
  • feat: support table/column lineage for mysql backend

    feat: support table/column lineage for mysql backend

    Summary of Changes

    The change is related to issue #2072. It is about supporting table/column lineage for mysql backend end to end.

    • databuilder: updated table_lineage model for table/column lineage record iterator and the corresponding sample_data_loader_mysql.py
    • metadata_service: updated mysql_proxy.py to add lineage related endpoint.

    Tests

    Added unit tests in both databuilder and metadata_service for lineage related change.

    Documentation

    N/A

    CheckList

    Make sure you have checked all steps below to ensure a timely review.

    • [X] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
    • [X] PR includes a summary of changes.
    • [X] PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
    • [X] In case of new functionality, my PR adds documentation that describes how to use it.
      • All the public functions and the classes in the PR contain docstrings that explain what it does
    area:databuilder area:metadata category:models category:proxy 
    opened by xuan616 0
  • feat: Make default depth configurable for table lineage graph view

    feat: Make default depth configurable for table lineage graph view

    Summary of Changes

    For certain deployments the graph view can get unreadable when it displays nodes at too great a depth. The app hardcodes the table lineage graph depth to 5 currently. This PR makes that value configurable using the config-types pattern that the rest of the front end uses.

    There is some discussion about the feature in a previous PR that I botched: https://github.com/amundsen-io/amundsen/pull/2069

    Tests

    Added a unit test for the function that gets the default graph depth

    Documentation

    CheckList

    Make sure you have checked all steps below to ensure a timely review.

    • [x] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
    • [x] PR includes a summary of changes.
    • [x] PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
    • [ ] In case of new functionality, my PR adds documentation that describes how to use it.
      • All the public functions and the classes in the PR contain docstrings that explain what it does
    status:completed area:frontend category:ui 
    opened by jsnb-devoted 10
  • Changed Ports for Gunicorn Command

    Changed Ports for Gunicorn Command

    fix : gunicorn start command contains same port in all three places. So changed into three different ports which we configured in ports section.

    Summary of Changes

    Changed into three different ports

    Tests

    Documentation

    CheckList

    Make sure you have checked all steps below to ensure a timely review.

    • [x] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
    • [x] PR includes a summary of changes.
    • [x] PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
    • [x] In case of new functionality, my PR adds documentation that describes how to use it.
      • All the public functions and the classes in the PR contain docstrings that explain what it does
    area:dev-tools 
    opened by cppandi 2
  • Updated MSSQL databuilder connection string

    Updated MSSQL databuilder connection string

    Summary of Changes

    updating the mssql databuilder connection string, since existing way of writng sql connection string with sqlalchemy is not working with the latest version of sql server, After spending few hours i found this way to build the connection string for mssql. I have mainly tested it with Azure MSSQL with SQL Server authentication and able to index the schema into neo4j.

    Changes made in this file https://github.com/amundsen-io/amundsen/blob/main/databuilder/example/scripts/sample_mssql_metadata.py

    Summary of Changes

    Tests

    Documentation

    CheckList

    Make sure you have checked all steps below to ensure a timely review.

    • [ ] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
    • [ ] PR includes a summary of changes.
    • [ ] PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
    • [ ] In case of new functionality, my PR adds documentation that describes how to use it.
      • All the public functions and the classes in the PR contain docstrings that explain what it does
    status:in_progress area:databuilder 
    opened by akumarseth 2
  • fix: issue 2066 - updated sample MySQL data loader file.

    fix: issue 2066 - updated sample MySQL data loader file.

    fixes: #2066

    Summary of Changes

    Changes made to: https://github.com/amundsen-io/amundsen/blob/main/databuilder/example/scripts/sample_mysql_loader.py

    import pymysql
    pymysql.install_as_MySQLdb()
    
    def connection_string():
        user = 'root'
        password='root'
        host = 'localhost'
        port = '3307'
        db = 'test_db'
        return "mysql+pymysql://%s:%s@%s:%s/%s" % (user,password, host, port, db)
    
    job_config = ConfigFactory.from_dict({
            f'extractor.mysql_metadata.{MysqlMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY}': where_clause_suffix,
            f'extractor.mysql_metadata.{MysqlMetadataExtractor.USE_CATALOG_AS_CLUSTER_NAME}': True,
            f'extractor.mysql_metadata.extractor.sqlalchemy.{SQLAlchemyExtractor.CONN_STRING}': connection_string(),
            f'loader.filesystem_csv_neo4j.{FsNeo4jCSVLoader.NODE_DIR_PATH}': node_files_folder,
            f'loader.filesystem_csv_neo4j.{FsNeo4jCSVLoader.RELATION_DIR_PATH}': relationship_files_folder,
            f'publisher.neo4j.{neo4j_csv_publisher.NODE_FILES_DIR}': node_files_folder,
            f'publisher.neo4j.{neo4j_csv_publisher.RELATION_FILES_DIR}': relationship_files_folder,
            f'publisher.neo4j.{neo4j_csv_publisher.NEO4J_END_POINT_KEY}': neo4j_endpoint,
            f'publisher.neo4j.{neo4j_csv_publisher.NEO4J_USER}': neo4j_user,
            f'publisher.neo4j.{neo4j_csv_publisher.NEO4J_PASSWORD}': neo4j_password,
            f'publisher.neo4j.{neo4j_csv_publisher.NEO4J_ENCRYPTED}': False,
            f'publisher.neo4j.{neo4j_csv_publisher.JOB_PUBLISH_TAG}': 'unique_tag',  # should use unique tag here like {ds}
        })
    

    Changes made to : https://github.com/amundsen-io/amundsen/blob/main/databuilder/setup.py

    requirements_path = os.path.join(os.path.dirname(os.path.realpath(__file__)),
                                     '../requirements.txt')
    

    CheckList

    Make sure you have checked all steps below to ensure a timely review.

    • [x] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
    • [x] PR includes a summary of changes.
    • [ ] PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
    • [ ] In case of new functionality, my PR adds documentation that describes how to use it.
      • All the public functions and the classes in the PR contain docstrings that explain what it does
    area:databuilder 
    opened by MalavikaN1 2
  • Amundsen is unable to import MYSQL data

    Amundsen is unable to import MYSQL data

    Expected Behavior

    Changed the connection string in https://github.com/amundsen-io/amundsen/blob/main/databuilder/example/scripts/sample_mysql_loader.py to load locally hosted MySQL data into Amundsen . Changes made in above file:

    import pymysql
    pymysql.install_as_MySQLdb()
    
    def connection_string():
        user = 'root'
        password='root'
        host = 'localhost'
        port = '3307'
        db = 'test_db'
        return "mysql+pymysql://%s:%s@%s:%s/%s" % (user,password, host, port, db)
    

    Current Behavior

    While running the python file, I get the following error: ERROR:neo4j:Failed to write data to connection IPv4Address(('127.0.0.1', 7687)) (IPv4Address(('127.0.0.1', 7687))).

    I tried loading the sample data by running the https://github.com/amundsen-io/amundsen/blob/main/databuilder/example/scripts/sample_data_loader.py file and it worked.

    Possible Solution

    fix: Adding the below code to job_config in sample_mysql_loader.py fixed the issue. f'publisher.neo4j.{neo4j_csv_publisher.NEO4J_ENCRYPTED}': False So now the code looks like this:

    job_config = ConfigFactory.from_dict({
            f'extractor.mysql_metadata.{MysqlMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY}': where_clause_suffix,
            f'extractor.mysql_metadata.{MysqlMetadataExtractor.USE_CATALOG_AS_CLUSTER_NAME}': True,
            f'extractor.mysql_metadata.extractor.sqlalchemy.{SQLAlchemyExtractor.CONN_STRING}': connection_string(),
            f'loader.filesystem_csv_neo4j.{FsNeo4jCSVLoader.NODE_DIR_PATH}': node_files_folder,
            f'loader.filesystem_csv_neo4j.{FsNeo4jCSVLoader.RELATION_DIR_PATH}': relationship_files_folder,
            f'publisher.neo4j.{neo4j_csv_publisher.NODE_FILES_DIR}': node_files_folder,
            f'publisher.neo4j.{neo4j_csv_publisher.RELATION_FILES_DIR}': relationship_files_folder,
            f'publisher.neo4j.{neo4j_csv_publisher.NEO4J_END_POINT_KEY}': neo4j_endpoint,
            f'publisher.neo4j.{neo4j_csv_publisher.NEO4J_USER}': neo4j_user,
            f'publisher.neo4j.{neo4j_csv_publisher.NEO4J_PASSWORD}': neo4j_password,
            f'publisher.neo4j.{neo4j_csv_publisher.NEO4J_ENCRYPTED}': False,
            f'publisher.neo4j.{neo4j_csv_publisher.JOB_PUBLISH_TAG}': 'unique_tag',  # should use unique tag here like {ds}
        })
    
    

    Your Environment

    • Amundsen databuilder version used: 7.4.3
    • Deployment (k8s or native): native
    • Link to your fork or repository: (https://github.com/MalavikaN1/amundsen)
    type:bug type:question status:needs_triage area:databuilder 
    opened by MalavikaN1 2
Releases(databuilder-7.4.3)
  • databuilder-7.4.3(Dec 16, 2022)

    What's Changed

    • feat: add configurable message to lineage tabs by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/2038
    • feat: tweaks styling of Alerts by @Golodhros in https://github.com/amundsen-io/amundsen/pull/2043
    • fix: add parenthesis to upstream tab title by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/2046
    • feat: extends resource notices to support extra information by @Golodhros in https://github.com/amundsen-io/amundsen/pull/2045
    • refactor - move homepage components in preparation implementation of configurable widgets. by @B-T-D in https://github.com/amundsen-io/amundsen/pull/2041
    • Fix: Corrected position of arguments in _par by @loojovi in https://github.com/amundsen-io/amundsen/pull/2037
    • Fixes UI crashing on "search page" if we multiple filters with the same category are added (issue #2053) by @mikaalanwar in https://github.com/amundsen-io/amundsen/pull/2057
    • Chore: Bump databuilder version to 7.4.3 by @sahithi03 in https://github.com/amundsen-io/amundsen/pull/2056

    New Contributors

    • @loojovi made their first contribution in https://github.com/amundsen-io/amundsen/pull/2037

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/common-0.30.0...databuilder-7.4.3

    Source code(tar.gz)
    Source code(zip)
  • common-0.30.0(Nov 29, 2022)

    What's Changed

    • fix: better styling for disabled items by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/2014
    • fix: adds loading spinners to table lineage tabs by @Golodhros in https://github.com/amundsen-io/amundsen/pull/2016
    • fix: fixes storybook installation by @Golodhros in https://github.com/amundsen-io/amundsen/pull/2017
    • fix: fixes Collapse text button overlapping lineage tabs by @Golodhros in https://github.com/amundsen-io/amundsen/pull/2019
    • fix: fixes cached lineage list content by @Golodhros in https://github.com/amundsen-io/amundsen/pull/2020
    • fix: fixes scrolling issue after tab changes by @Golodhros in https://github.com/amundsen-io/amundsen/pull/2021
    • chore: updates logging to cover tour and feedback widget by @Golodhros in https://github.com/amundsen-io/amundsen/pull/2024
    • chore: support amundsen-rds == 0.0.7 for mysql_proxy in metadata_service by @xuan616 in https://github.com/amundsen-io/amundsen/pull/2022
    • fix: Use isColumnLineagePageEnabled by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/2012
    • chore:feedback cleanup by @Golodhros in https://github.com/amundsen-io/amundsen/pull/2027
    • Fix: Change default value for 'description' in BigQuery_metadata_extractor results by @sahithi03 in https://github.com/amundsen-io/amundsen/pull/2034
    • feat: Add lineage item counts to lineage response by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/2039

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/common-0.29.0...common-0.30.0

    Source code(tar.gz)
    Source code(zip)
  • common-0.29.0(Oct 18, 2022)

    What's Changed

    • feat: added optional in_amundsen bool to lineage items by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/2010

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/common-0.28.0...common-0.29.0

    Source code(tar.gz)
    Source code(zip)
  • common-0.28.0(Oct 12, 2022)

    What's Changed

    • fix: support capitalized table names by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/2004
    • feat: use different internal link than the table details page on lineage by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/2006

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.4.2...common-0.28.0

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.4.2(Oct 3, 2022)

    What's Changed

    • chore--update amundsen-rds version in databuilder requirements.txt by @B-T-D in https://github.com/amundsen-io/amundsen/pull/2001

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.4.1...databuilder-7.4.2

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.4.1(Sep 29, 2022)

    What's Changed

    • Fix toggle filter and styling; update tests by @B-T-D in https://github.com/amundsen-io/amundsen/pull/1995
    • docs: Fix typo on BigQuery's instructions by @LieAlbertTriAdrian in https://github.com/amundsen-io/amundsen/pull/1997
    • chore: remove sqlalchemy dependency upper bound by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/2000

    New Contributors

    • @B-T-D made their first contribution in https://github.com/amundsen-io/amundsen/pull/1995
    • @LieAlbertTriAdrian made their first contribution in https://github.com/amundsen-io/amundsen/pull/1997

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.4.0...databuilder-7.4.1

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.4.0(Sep 19, 2022)

    What's Changed

    • fix: more logging for badges by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1991
    • feat: Add configurable prop types to neo4j csv publisher by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1993

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.3.0...databuilder-7.4.0

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.3.0(Sep 13, 2022)

    What's Changed

    • fix: no reason to raise a 404 when a user has no bookmarks or reads by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1964
    • fix: Dashboard User relationships raised 404s also by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1965
    • Feat/kafka schema registry integration by @farbodahm in https://github.com/amundsen-io/amundsen/pull/1959
    • Get the first item from the healthcheck response by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1968
    • feat: Extend Lineage list view configuration by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/1961
    • chore: Update 'Who uses Amundsen?' by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/1973
    • fix: Fix hideNonClickableBadges configuration by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/1974
    • fix: Correct sharded table prefix extraction in Bigquery Usage Extractor by @sahithi03 in https://github.com/amundsen-io/amundsen/pull/1980
    • feat-use-retryable-query-executor by @Owen-LCH in https://github.com/amundsen-io/amundsen/pull/1941
    • fix: add f to search filter logging by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1984
    • feat: Allow null values set for empty props in neo4j unwind publisher and multiple rels between nodes by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1983
    • chore: Bump databuilder version to 7.3.0 by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1985

    New Contributors

    • @farbodahm made their first contribution in https://github.com/amundsen-io/amundsen/pull/1959

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.2.1...databuilder-7.3.0

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.2.1(Aug 17, 2022)

    What's Changed

    • feat: Neo4j driver 4.4.5 on metadata by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1952
    • fix: For new publisher fix error handling for already created constraints/indices by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1963

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.2.0...databuilder-7.2.1

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.2.0(Aug 16, 2022)

    What's Changed

    • fix: Add postgres compatibility in HiveTableLastUpdatedExtractor by @chonyy in https://github.com/amundsen-io/amundsen/pull/1879
    • fix: Add postgres compatibility in PrestoViewMetadataExtractor by @chonyy in https://github.com/amundsen-io/amundsen/pull/1878
    • fix: Fix handling of big ints in preview data by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1950
    • fix: Fix column description overflow-y value by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/1937
    • feat: Add config to hide non-clickable badges by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/1943
    • fix: Don't retrieve column lineage when it is not enabled by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1956
    • fix: Fix links in Announcements by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/1934
    • feat: Add optional configuration to disable Lineage list view links by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/1958
    • perf: New neo4j csv publisher to improve performance using batched params by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1957

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.1.2...databuilder-7.2.0

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.1.2(Aug 1, 2022)

    What's Changed

    • fix: session db name by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1948

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.1.1...databuilder-7.1.2

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.1.1(Jul 28, 2022)

    What's Changed

    • fix: driver object pickle error by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1944

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.1.0...databuilder-7.1.1

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.1.0(Jul 27, 2022)

    What's Changed

    • feat: Exclude stats icon if configured stat types are the only ones present by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1939
    • fix: Show the column in the center of the table when navigating to a column link by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1940
    • feat: Neo4j 4.x support by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1942

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-7.0.0...databuilder-7.1.0

    Source code(tar.gz)
    Source code(zip)
  • databuilder-7.0.0(Jul 21, 2022)

    THIS RELEASE IS BACKWARDS INCOMPATIBLE FOR ANYONE USING NEO4J DB < 3.5

    What's Changed

    • chore: migrate databuilder to neo4j-driver 4.4.5 by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1938

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/metadata-3.11.0...databuilder-7.0.0

    Source code(tar.gz)
    Source code(zip)
  • metadata-3.11.0(Jul 12, 2022)

    What's Changed

    • feat: add get_dashbaord support for neptune by @Owen-LCH in https://github.com/amundsen-io/amundsen/pull/1927
    • fix: Fix nested UI for eventbridge metadata by @kahrabian in https://github.com/amundsen-io/amundsen/pull/1912

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-6.12.0...metadata-3.11.0

    Source code(tar.gz)
    Source code(zip)
  • databuilder-6.12.0(Jul 7, 2022)

    What's Changed

    • fix: Change TabsComponent styling to only be sticky in certain cases by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1925
    • feat: Adding new Trino type parser and other type metadata updates by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1917

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-6.11.1...databuilder-6.12.0

    Source code(tar.gz)
    Source code(zip)
  • databuilder-6.11.1(Jul 6, 2022)

    What's Changed

    • feat: Sticky TabsComponent and Table headers by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1924
    • feat: add get_lineage support for neptune backend by @Owen-LCH in https://github.com/amundsen-io/amundsen/pull/1915
    • fix: Update bounds for databuilder google-auth versions by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/1918

    New Contributors

    • @Owen-LCH made their first contribution in https://github.com/amundsen-io/amundsen/pull/1915

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-6.11.0...databuilder-6.11.1

    Source code(tar.gz)
    Source code(zip)
  • databuilder-6.11.0(Jul 5, 2022)

    What's Changed

    • refactor: Remove all old frontend parsing for nested columns by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1919
    • docs: Improves the frontend documentation on announcements by @xfiderek in https://github.com/amundsen-io/amundsen/pull/1921
    • feat: Extract search results per page into a config variable by @ozandogrultan in https://github.com/amundsen-io/amundsen/pull/1922
    • feat: added ngram subfield with no stemming on ES mappings by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1895
    • chore: bumped databuilder to 6.11.0 by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1923

    New Contributors

    • @xfiderek made their first contribution in https://github.com/amundsen-io/amundsen/pull/1921

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-6.10.0...databuilder-6.11.0

    Source code(tar.gz)
    Source code(zip)
  • databuilder-6.10.0(Jun 30, 2022)

  • databuilder-6.9.0(Jun 23, 2022)

    What's Changed

    • feat: added addition fields config to publisher by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1898

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/frontend-4.2.0...databuilder-6.9.0

    Source code(tar.gz)
    Source code(zip)
  • frontend-4.2.0(Jun 22, 2022)

    What's Changed

    • chore: updated company list in readme by @xuan616 in https://github.com/amundsen-io/amundsen/pull/1863
    • docs: Update docs for Windows workaround to solve databuilder extras_require error. by @alanmcruickshank in https://github.com/amundsen-io/amundsen/pull/1861
    • refactor: Refactor various column details and add TypeMetadata to TableColumn model by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1864
    • feat: Adding nested columns to be displayed in the column dropdowns as rows by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1865
    • feat: Allow dangerous html based on config variable by @MrwanBaghdad in https://github.com/amundsen-io/amundsen/pull/1459
    • chore: updated search readme by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1859
    • feat: Nested columns special type rows and expand by default by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1872
    • feat: Use type metadata description get/update apis by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1876
    • feat: Search Service Highlighting by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1856
    • feat: search highlighting UI by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1850
    • fix: typos in search proxy impl by @mgorsk1 in https://github.com/amundsen-io/amundsen/pull/1884
    • fix: Highlighting styling improvement by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1887
    • test: Fix databuilder PR unit test import error by @chonyy in https://github.com/amundsen-io/amundsen/pull/1891
    • feat: Adding expand all/collapse all functionality for nested columns by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1888
    • fix: relax pandas version requirements for databuilder by @henridwyer in https://github.com/amundsen-io/amundsen/pull/1858
    • feat: Add clickable rows to table detail page and new expand/collapse arrow icons by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1897
    • fix: Various fixes to nested columns based on feedback by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1901
    • fix: Updating tour feature to store wildcard path instead of the url path by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1904
    • fix: Fixing markdown for truncated column descriptions in the table by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1905
    • fix: Handle qualified tableau datasources more gracefully by @alanmcruickshank in https://github.com/amundsen-io/amundsen/pull/1869
    • feat: Amazon EventBridge Extractor by @kahrabian in https://github.com/amundsen-io/amundsen/pull/1881
    • fix: Fixing column description markdown to handle more multiline cases by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1906
    • feat: Enable new nested columns by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1907

    New Contributors

    • @alanmcruickshank made their first contribution in https://github.com/amundsen-io/amundsen/pull/1861
    • @chonyy made their first contribution in https://github.com/amundsen-io/amundsen/pull/1891
    • @henridwyer made their first contribution in https://github.com/amundsen-io/amundsen/pull/1858
    • @kahrabian made their first contribution in https://github.com/amundsen-io/amundsen/pull/1881

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/frontend-4.1.2...frontend-4.2.0

    Source code(tar.gz)
    Source code(zip)
  • search-4.0.2(May 16, 2022)

    What's Changed

    • fix: toggle filter should clear when off by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1848
    • refactor: Refactor various column details and add TypeMetadata to TableColumn model by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1847
    • fix: fixes tour not resetting on different pages by @Golodhros in https://github.com/amundsen-io/amundsen/pull/1849
    • fix: better behavior for search filters by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1852
    • fix: fixes state sharing between tours of different pages by @Golodhros in https://github.com/amundsen-io/amundsen/pull/1854
    • fix: avoid extra load url search call when default filters are applied by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1853
    • feat: enable UI config overrides post build by @rajasekharreddyparvatha in https://github.com/amundsen-io/amundsen/pull/1830
    • feat: update search service to use new search mappings by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1832
    • fix: fix misaligned source icon by @youngyjd in https://github.com/amundsen-io/amundsen/pull/1855
    • chore: bump release versions for search and frontend by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1862

    New Contributors

    • @rajasekharreddyparvatha made their first contribution in https://github.com/amundsen-io/amundsen/pull/1830

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-6.8.0...search-4.0.2

    Source code(tar.gz)
    Source code(zip)
  • frontend-4.1.2(May 16, 2022)

    What's Changed

    • fix: toggle filter should clear when off by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1848
    • refactor: Refactor various column details and add TypeMetadata to TableColumn model by @kristenarmes in https://github.com/amundsen-io/amundsen/pull/1847
    • fix: fixes tour not resetting on different pages by @Golodhros in https://github.com/amundsen-io/amundsen/pull/1849
    • fix: better behavior for search filters by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1852
    • fix: fixes state sharing between tours of different pages by @Golodhros in https://github.com/amundsen-io/amundsen/pull/1854
    • fix: avoid extra load url search call when default filters are applied by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1853
    • feat: enable UI config overrides post build by @rajasekharreddyparvatha in https://github.com/amundsen-io/amundsen/pull/1830
    • feat: update search service to use new search mappings by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1832
    • fix: fix misaligned source icon by @youngyjd in https://github.com/amundsen-io/amundsen/pull/1855
    • chore: bump release versions for search and frontend by @allisonsuarez in https://github.com/amundsen-io/amundsen/pull/1862

    New Contributors

    • @rajasekharreddyparvatha made their first contribution in https://github.com/amundsen-io/amundsen/pull/1830

    Full Changelog: https://github.com/amundsen-io/amundsen/compare/databuilder-6.8.0...frontend-4.1.2

    Source code(tar.gz)
    Source code(zip)
  • databuilder-6.8.0(May 2, 2022)

  • common-0.27.1(Apr 27, 2022)

  • common-0.27.0(Apr 26, 2022)

  • databuilder-6.7.5(Apr 25, 2022)

  • common-0.26.2(Mar 29, 2022)

  • common-0.26.1(Mar 29, 2022)

    Adds TypeMetadata as a possible ResourceType to allow implementing metadata service endpoints like put_resource_description which take a ResourceType.

    Source code(tar.gz)
    Source code(zip)
  • databuilder-6.7.4(Mar 18, 2022)

Owner
Amundsen
Amundsen is a data discovery and metadata engine. It's an incubation project in LF AI & Data Foundation.
Amundsen
Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Damien Farrell 81 Dec 26, 2022
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 03, 2022
Import, connect and transform data into Excel

xlwings_query Import, connect and transform data into Excel. Description The concept is to apply data transformations to a main query object. When the

George Karakostas 1 Jan 19, 2022
This is a repo documenting the best practices in PySpark.

Spark-Syntax This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark f

Eric Xiao 447 Dec 25, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023
INFO-H515 - Big Data Scalable Analytics

INFO-H515 - Big Data Scalable Analytics Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group Exercise clas

Yann-Aël Le Borgne 58 Dec 11, 2022
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

2 Jun 27, 2022
A columnar data container that can be compressed.

Unmaintained Package Notice Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During

944 Dec 09, 2022
BIGDATA SIMULATION ONE PIECE WORLD CENSUS

ONE PIECE is a Japanese manga of great international success. The story turns inhabited in a fictional world, tells the adventures of a young man whose body gained rubber properties after accidentall

Maycon Cypriano 3 Jun 30, 2022
Jupyter notebooks for the book "The Elements of Statistical Learning".

This repository contains Jupyter notebooks implementing the algorithms found in the book and summary of the textbook.

Madiyar 369 Dec 30, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
MapReader: A computer vision pipeline for the semantic exploration of maps at scale

MapReader A computer vision pipeline for the semantic exploration of maps at scale MapReader is an end-to-end computer vision (CV) pipeline designed b

Living with Machines 25 Dec 26, 2022
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 01, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022