4CAT: Capture and Analysis Toolkit

Overview

4CAT: Capture and Analysis Toolkit

DOI: 10.5281/zenodo.4742622 License: MPL 2.0 Requires Python 3.8 Docker Image CI Status

A screenshot of 4CAT, displaying its 'Create Dataset' interfaceA screenshot of 4CAT, displaying a network visualisation of a dataset

4CAT is a research tool that can be used to analyse and process data from online social platforms. Its goal is to make the capture and analysis of data from these platforms accessible to people through a web interface, without requiring any programming or web scraping skills. Our target audience is researchers, students and journalists interested using Digital Methods in their work.

In 4CAT, you create a dataset from a given platform according to a given set of parameters; the result of this (usually a CSV file containing matching items) can then be downloaded or analysed further with a suite of analytical 'processors', which range from simple frequency charts to more advanced analyses such as the generation and visualisation of word embedding models.

4CAT has a (growing) number of supported data sources corresponding to popular platforms that are part of the tool, but you can also add additional data sources using 4CAT's Python API. The following data sources are currently supported actively:

  • 4chan
  • 8kun
  • Bitchute
  • Parler
  • Reddit
  • Telegram
  • Twitter API (Academic and regular tracks)

The following platforms are supported through other tools, from which you can import data into 4CAT for analysis:

A number of other platforms have built-in support that is untested, or requires e.g. special API access. You can view the full list of data sources in the GitHub repository.

Install

You can install 4CAT locally or on a server via Docker or manually. The usual

docker-compose up

will work, but detailed and alternative installation instructions are available in our wiki. Currently 4chan, 8chan, and 8kun require additional steps; please see the wiki.

Please check our issues and create one if you experience any problems (pull requests are also very welcome).

Components

4CAT consists of several components, each in a separate folder:

  • backend: A standalone daemon that collects and processes data, as queued via the tool's web interface or API.
  • webtool: A Flask app that provides a web front-end to search and analyze the stored data with.
  • common: Assets and libraries.
  • datasources: Data source definitions. This is a set of configuration options, database definitions and python scripts to process this data with. If you want to set up your own data sources, refer to the wiki.
  • processors: A collection of data processing scripts that can plug into 4CAT and manipulate or process datasets created with 4CAT. There is an API you can use to make your own processors.

Credits & License

4CAT was created at OILab and the Digital Methods Initiative at the University of Amsterdam. The tool was inspired by the TCAT, a tool with comparable functionality that can be used to scrape and analyse Twitter data.

4CAT development is supported by the Dutch PDI-SSH foundation through the CAT4SMR project.

4CAT is licensed under the Mozilla Public License, 2.0. Refer to the LICENSE file for more information.

Comments
  • Allow autologin to _always_ work (or perhaps disable login?)

    Allow autologin to _always_ work (or perhaps disable login?)

    I am running a 4cat server in docker, with a apache2 reverse proxy in front. It works fine except for one small thing.

    MYSERVER.domain host my apache proxy.

    In settings -> Flask settings I have: Auto-login name = MYSERVER.domain

    However when i access through the proxy don't want to meet a login to 4cat. I just want to be inside. I was thinking that Auto-login name would whitelist hosts so they could bypass login?

    enhancement 
    opened by anderscollstrup 21
  • Docker swarm server: Cannot make flask frontend work and login (not using default docker-compose) flask overwriting settings  values in database

    Docker swarm server: Cannot make flask frontend work and login (not using default docker-compose) flask overwriting settings values in database

    Hi, I have 4cat running in a docker swarm server. After modifying a little bit the compose file to be compatible in docker swarm and other little bit the environment variables i got it running but I cannot login. I see this is a security feature with flask. I have read https://github.com/digitalmethodsinitiative/4cat/issues/269 also it is related to issue https://github.com/digitalmethodsinitiative/4cat/issues/272 I cannot find the whitelist or where is it, since now there is no config.py

    Here is a dump of my postgresql database table of settings, Maybe it is relevant.

    
    
    
    DATASOURCES               | {"bitchute": {}, "custom": {}, "douban": {}, "customimport": {}, "parler": {}, "reddit": {"boards": "*"}, "telegram": {}, "twitterv2": {"id_lookup": false}}
     4cat.name                 | "4CAT"
     4cat.name_long            | "4CAT: Capture and Analysis Toolkit"
     4cat.github_url           | "https://github.com/digitalmethodsinitiative/4cat"
     path.versionfile          | ".git-checked-out"
     expire.timeout            | 0
     expire.allow_optout       | true
     logging.slack.level       | "WARNING"
     logging.slack.webhook     | null
     mail.admin_email          | null
     mail.host                 | null
     mail.ssl                  | false
     mail.username             | null
     mail.password             | null
     mail.noreply              | "[email protected]"
     SCRAPE_TIMEOUT            | 5
     SCRAPE_PROXIES            | {"http": []}
     IMAGE_INTERVAL            | 3600
     explorer.max_posts        | 100000
     flask.flask_app           | "webtool/fourcat"
     flask.secret_key          | "2e3037b7533c100f324e472a"
     flask.https               | false
     flask.autologin.name      | "Automatic login"
     flask.autologin.api       | ["localhost", "4cat.coraldigital.mx", "\"4cat.coraldigital.mx\"", "51.81.52.207", "0.0.0.0"]
     flask.server_name         | ""
     flask.autologin.hostnames | ["*"]
    
    docker issue 
    opened by hydrosIII 17
  • Cannot make flask frontend work

    Cannot make flask frontend work

    Backend is running: [email protected]:/usr/local/4cat# [email protected]:/usr/local/4cat# ps -ef | grep python root 497 1 0 10:36 ? 00:00:02 /usr/bin/python3 /usr/bin/fail2ban-server -xf start root 516 1 0 10:36 ? 00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal 4cat 18989 1 59 12:39 ? 00:00:01 /usr/bin/python3 4cat-daemon.py start root 19008 891 0 12:39 pts/0 00:00:00 grep python [email protected]:/usr/local/4cat#

    [email protected]:/usr/local/4cat# [email protected]:/usr/local/4cat# pip install python-dotenv Collecting python-dotenv Downloading python_dotenv-0.20.0-py3-none-any.whl (17 kB) Installing collected packages: python-dotenv Successfully installed python-dotenv-0.20.0 [email protected]/usr/local/4cat# [email protected]:/usr/local/4cat# FLASK_APP=webtool flask run --host=0.0.0.0

    • Serving Flask app "webtool"
    • Environment: production WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
    • Debug mode: off
    • Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) /usr/local/lib/python3.9/dist-packages/flask/sessions.py:208: UserWarning: "localhost" is not a valid cookie domain, it must contain a ".". Add an entry to your hosts file, for example "localhost.localdomain", and use that instead. warnings.warn( MY PC IP - - [10/Jun/2022 12:36:54] "GET / HTTP/1.1" 404 -

    And I get 404 in my browser when I point to http://server_ip:5000

    4cat is installed using this guide: https://github.com/digitalmethodsinitiative/4cat/wiki/Installing-4CAT Install 4cat manually

    docker issue 
    opened by anderscollstrup 17
  • Issue with migrate.py preventing me from running 4cat or accessing web interface

    Issue with migrate.py preventing me from running 4cat or accessing web interface

    Hello, thanks for making this tool available. I'd be grateful for any tips: I'm getting an 'EOFError: EOF when reading a line' message when I run docker-compose up. I'm using Windows 10 Home. I initially tried to install 4cat manually to scrape 4chan, but I couldn't get it to work so I uninstalled and then tried to install through Docker.

    I'm using Windows Powershell to run the command because when I run docker-compose up in Ubuntu 20.04 LTS I'm getting this message:

    'The command 'docker-compose' could not be found in this WSL 2 distro. We recommend to activate the WSL integration in Docker Desktop settings.

    See https://docs.docker.com/desktop/windows/wsl/ for details.'

    The WSL integration is activated in Docker Desktop settings by default. Could it be because I didn't bind-mount the folder I'm storing 4cat in to the Linux file system? I skipped that step and just stored 4cat in /c/users/myusername/ on Windows.

    This is the message I get when I run docker-compose up command from Powershell:

    PS C:\users\myusername\4cat> docker-compose up [+] Running 2/2

    • Container cat_db_1 Running 0.0s
    • Container api Recreated 0.7s Attaching to api, db_1 api | Waiting for postgres... api | PostgreSQL started api | 1 api | Seed present api | Starting app api | Running migrations api | api | 4CAT migration agent api | ------------------------------------------ api | Current 4CAT version: 1.9 api | Checked out version: 1.16 api | The following migration scripts will be run: api | migrate-1.9-1.10.py api | migrate-1.10-1.11.py api | migrate-1.11-1.12.py api | migrate-1.12-1.13.py api | migrate-1.13-1.14.py api | migrate-1.14-1.15.py api | WARNING: Migration can take quite a while. 4CAT will not be available during migration. api | If 4CAT is still running, it will be shut down now. api | Do you want to continue [y/n]? Traceback (most recent call last): api | File "helper-scripts/migrate.py", line 142, in api | if not args.yes and input("").lower() != "y": api | EOFError: EOF when reading a line api exited with code 1
    opened by robbydigital 15
  • Unknown local index '4chan_posts' in search request

    Unknown local index '4chan_posts' in search request

    We managed to overcome our previous issue thanks to your advise. However we are now stuck with a error related to the indexes, appearing whenever we query 4chan.

    First we have generated the sphinx.conf using helper_script/generate_sphinx_config.py. This result in the following indexes:

    ` [...]

    /* Indexes */

    index 4cat_index { min_infix_len = 3 html_strip = 1 type = template charset_table = 0..9, a..z, _, A..Z->a..z, U+47, U+58, U+40, U+41, U+00C0->a, U+00C1->a, U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c,$ }

    index 4chan_posts : 4cat_index { type = plain source = 4chan_posts_old path = /opt/sphinx/data/4chan_posts }

    index 4chan_posts : 4cat_index { type = plain source = 4chan_posts_new path = /opt/sphinx/data/4chan_posts } [...] However starting sphinx with this setup result in the following error:Mar 16 11:48:44 dev sphinxsearch[505]: ERROR: section '4chan_posts' (type='index') already exists in /etc/sphinxsearch/sphinx.conf line 51 col 19. ` I have then attempted to uncomment one of the indexes and/or changing the path which allows for sphinx to start. However another error then appears when collection have been initiated:

    16-03-2020 11:50:54 | ERROR (threading.py:884): Sphinx crash during query deb9cfe3e0a47d56612fd6e453208ed6: (1064, "unknown local index '4chan_posts' in search request\x00")
    

    Hope you once again can help me figure out how the indexes should be set.

    opened by bornakke 12
  • Installing problem: frontend failed to run with 'docker-compose up' command

    Installing problem: frontend failed to run with 'docker-compose up' command

    When running the command docker-compose up, the database and backend components goes well, but the frontend component could not lead to a result, and always stuck at "[INFO] Booting worker with pid: 12" . The problem is still there after restarting the frontend component on Docker UI.

    docker issue 
    opened by baiyuan523 11
  • Error

    Error "string indices must be integers" from search_twitter.py:403

    From our 4cat.log

    21-09-2021 10:48:11 | INFO (processor.py:890): Running processor count-posts on dataset a5eeaf86aa27ff91f212d35880090d70
    21-09-2021 10:48:11 | INFO (processor.py:890): Running processor attribute-frequencies on dataset 659e224c54209146f7551523e8d26633
    21-09-2021 10:48:11 | ERROR (worker.py:890): Processor count-posts raised TypeError while processing dataset a5eeaf86aa27ff91f212d35880090d70 (via 76e33804acca3ac18d3cfa8de8059780) in count_posts.py:59->processor.py:316->search_twitter.py:403:
       string indices must be integers
    
    21-09-2021 10:48:11 | ERROR (worker.py:890): Processor attribute-frequencies raised TypeError while processing dataset 659e224c54209146f7551523e8d26633 (via 01db05ce10f58b320a397d68b61986a2) in rank_attribute.py:132->processor.py:316->search_twitter.py:403:
       string indices must be integers
    

    The line in question is from SearchWithTwitterAPIv2.map_item() https://github.com/digitalmethodsinitiative/4cat/blob/f0e01fb500b7dafb58a05873cf34bf15e288a88c/datasources/twitterv2/search_twitter.py#L403

    and I haven't found a good way to bring 4CAT under a debugger and/or inform me of an ID for the violating tweet.

    Could this be related to #169 ?

    opened by xmacex 10
  • AttributeError: 'Namespace' object has no attribute 'release'

    AttributeError: 'Namespace' object has no attribute 'release'

    Fresh installation on MAC with Docker from local files. Any idea what i did wrong?

    4cat_backend:

    Waiting for postgres... PostgreSQL started Database already created

    Traceback (most recent call last): File "helper-scripts/migrate.py", line 66, in if args.release: AttributeError: 'Namespace' object has no attribute 'release'

    4cat_backend EXITED (1)

    bug deployment 
    opened by psegovias 9
  • Docker setup fails to

    Docker setup fails to "import config" on macOS Big Sur (M1)

    Discussed in https://github.com/digitalmethodsinitiative/4cat/discussions/191

    Originally posted by p-charis October 25, 2021 Hey everyone! First, thanks a million to the developers for building this & making it available :)

    Now, I managed to get 4CAT working on a macOS (latest version-M1 native) but only after I removed the following lines from the docker-setup.py file (line #36 onwards). Without these lines the installation wouldn't work as it returned the error that no module named config was found. I suspect it might have sth to do with the way that Docker runs on macOS generally and the paths it creates, but I haven't figured it out yet. So, I just wanted to let the Devs know, as well as other macOS users that, if they've had a similar problem, they could try this workaround.

    # Ensure filepaths exist
    import config
    for path in [config.PATH_DATA,
                 config.PATH_IMAGES,
                 config.PATH_LOGS,
                 config.PATH_LOCKFILE,
                 config.PATH_SESSIONS,
                 ]:
        if Path(config.PATH_ROOT, path).is_dir():
            pass
        else:
            os.makedirs(Path(config.PATH_ROOT, path))</div>
    
    bug docker issue 
    opened by p-charis 8
  • Tokeniser exclusion list ignores last word in list

    Tokeniser exclusion list ignores last word in list

    I'm filtering some commonly used words out of a corpus with the Tokenise processor and it only seems to be partially successful. For example in one month there are 37,325 instances of one word. When I add the word to the reject list there are still 6307 instances of the word. So it's getting most but not at all. I'm having the same issue with some common swear words that I'm trying to filter out - most are gone, but some remain. Is there a reason for this?

    Thanks for any insight!

    opened by robbydigital 6
  • Datasource that interfaces with a TCAT instance

    Datasource that interfaces with a TCAT instance

    It works, and arguably fixes #117, but:

    • The form looks hideous with the million query fields. Do we need them all for 4CAT? Is there a way to make it look better?
    • The list of bins displayed in the 'create dataset' form simply lists bins from all instances. This can get really long really fast when supporting multiple instances. A custom form control may be necessary to make this user-friendly.
    • The list of bins is loaded synchronously whenever get_options() is run. The result should probably be cached or updated in the background (with a separate worker...?)
    • The data format now follows that of twitterv2's map_item(), but there is quite a bit more data in the TCAT export that we could include.
    opened by stijn-uva 6
  • Update 'FAQ' and 'About' pages

    Update 'FAQ' and 'About' pages

    The 'About' page should probably refer to documentation and guides etc rather than the 'news' thing it's doing now, and the FAQ is still very 4chan-oriented.

    enhancement (mostly) front-end 
    opened by stijn-uva 0
  • Feature request: allow data from linked telegram chat channels to be collected

    Feature request: allow data from linked telegram chat channels to be collected

    Telegram chats have linked "discussion" channels, where users can respond to messages in the main channel. Occasionally, these are also public, and if so, can also be found by the API. It would be useful to allow users to also automatically collect data from these chat channels if they're found.

    A note on this and future feature requests: we're (https://github.com/GateNLP) putting in some additions to the telegram data collector on our end. Thought it might be worth checking if there's scope for them to be added to the original/main instance.

    If any issues with this/they don't really fit with what you have in mind for your instance, all fine, we'll continue to maintain them on our own fork instead!

    Linked pull request: https://github.com/digitalmethodsinitiative/4cat/pull/322

    enhancement data source 
    opened by muneerahp 1
  • LIHKG data source

    LIHKG data source

    A data source, for LIHKG. Uses the web interface's web API, which seems reasonable straightforward and stable. There is some rate limiting, which 4CAT tries to respect by pacing requests and implementing an exponential backoff.

    enhancement data source questionable 
    opened by stijn-uva 0
  • ability to count frequency for specific (and multiple) keywords over time

    ability to count frequency for specific (and multiple) keywords over time

    a processor that can filter on multiple particular words or phrases within a dataset, and outputs the count values (overall, or over time) per item, outputting a .csv that can be imported into raw graphs to compare the evolution of different words/phrases over time, either in absolute or in relative numbers.

    processors data source 
    opened by daniel-dezeeuw 0
  • Warn about need to update Docker `.env` file when upgrading 4CAT to new version

    Warn about need to update Docker `.env` file when upgrading 4CAT to new version

    When using Docker, the .env file can be used to ensure you pull a particular version of 4CAT. If you then upgrade 4CAT interactively, we cannot modify the .env file (which exists on the users host machine). If a user removes or rebuilds 4CAT, it will pull the version of 4CAT listed in the .env file which will not be the latest version that was upgraded to.

    I will look at adding a warning/notification to the upgrade logs to notify users of the need to update their .env file.

    enhancement deployment 
    opened by dale-wahl 0
Releases(v1.29)
  • v1.29(Oct 6, 2022)

    Snapshot of 4CAT as of October 2022. Many changes and fixes since the last official release, including:

    • Restart and upgrade 4CAT via the web interface (#181, #287, #288)
    • Addition of several processors for Twitter datasets to increase inter-operability with DMI-TCAT
    • DMI-TCAT data source, which can interface with a DMI-TCAT instance to create datasets from tweets stored therein (#226)
    • LinkedIn data source, to be used together with Zeeschuimer
    • Fixes & improvements to Docker container set-up and build process (#269, #270, #290)
    • A number of processors have been updated to transparently filter NDJSON datasets instead of turning them into CSV datasets (#253, #282, #291, #292)
    • And many smaller fixes & updates

    From this release onwards, 4CAT can be upgraded to the latest release via the Control Panel in the web interface.

    Source code(tar.gz)
    Source code(zip)
  • v1.26(May 10, 2022)

    Many updates:

    • Configuration is now stored in the database and (mostly) editable via the web GUI
    • The Telegram datasource now collects more data and stores the 'raw' message objects as NDJSON
    • Dialogs in the web UI now use custom widgets instead of alert()
    • Twitter datasets will retrieve the expected amount of tweets before capturing and ask for confirmation if it is a high number
    • Various fixes and tweaks to the Dockerfiles
    • New extended data source information pages with details about limitations, caveats, useful links, etc
    • And much more
    Source code(tar.gz)
    Source code(zip)
  • v1.25(Feb 24, 2022)

    Snapshot of 4CAT as of 24 February 2022. Many changes and fixes since the last official release, including:

    • Explore and annotate your datasets interactively with the new Explorer (beta)
    • Datasets can be set to automatically get deleted after a set amount of time, and can be made private
    • Incremental refinement of the web interface
    • Twitter datasets can be exported to a DMI-TCAT instance
    • User accounts can now be deactivated (banned)
    • Many smaller fixes and new features
    Source code(tar.gz)
    Source code(zip)
  • v1.21(Sep 28, 2021)

    Snapshot of 4CAT as of 28 September 2021. Many changes and fixes since the last official release, including:

    • User management via control panel
    • Improved Docker support
    • Improved 4chan data dump import helper scripts
    • Improved country code filtering for 4chan/pol/ datasets
    • More robust and versatile network analysis processors
    • Various new filter processors
    • Topic modeling processor
    • Support for non-academic Twitter API queries
    • Option to download NDJSON datasets as CSV
    • Support for hosting 4CAT with a non-root URL
    • And many more
    Source code(tar.gz)
    Source code(zip)
  • v1.18a(May 7, 2021)

  • v1.17(Apr 8, 2021)

  • v1.9b1(Jan 17, 2020)

  • v1.0b1(Feb 28, 2019)

    4CAT is now ready for wider use! It offers...

    • An API that can be used to queue and manipulate queries programmatically
    • Diverse analytical post-processors that may be combined to further analyse data sets
    • A flexible interface for adding various data sources
    • A robust scraper
    • A very retro interface
    Source code(tar.gz)
    Source code(zip)
Owner
Digital Methods Initiative
The Digital Methods Initiative (DMI) is one of Europe's leading Internet Studies research groups. Research tools it develops are collected here.
Digital Methods Initiative
A collection of robust and fast processing tools for parsing and analyzing web archive data.

ChatNoir Resiliparse A collection of robust and fast processing tools for parsing and analyzing web archive data. Resiliparse is part of the ChatNoir

ChatNoir 24 Nov 29, 2022
Handle, manipulate, and convert data with units in Python

unyt A package for handling numpy arrays with units. Often writing code that deals with data that has units can be confusing. A function might return

The yt project 304 Jan 02, 2023
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
Data and code accompanying the paper Politics and Virality in the Time of Twitter

Politics and Virality in the Time of Twitter Data and code accompanying the paper Politics and Virality in the Time of Twitter. In specific: the code

Cardiff NLP 3 Jul 02, 2022
A meta plugin for processing timelapse data timepoint by timepoint in napari

napari-time-slicer A meta plugin for processing timelapse data timepoint by timepoint. It enables a list of napari plugins to process 2D+t or 3D+t dat

Robert Haase 2 Oct 13, 2022
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
Monitor the stability of a pandas or spark dataframe ⚙︎

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

ING Bank 403 Dec 07, 2022
Evaluation of a Monocular Eye Tracking Set-Up

Evaluation of a Monocular Eye Tracking Set-Up As part of my master thesis, I implemented a new state-of-the-art model that is based on the work of Che

Pascal 19 Dec 17, 2022
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
Learn machine learning the fun way, with Oracle and RedBull Racing

Red Bull Racing Analytics Hands-On Labs Introduction Are you interested in learning machine learning (ML)? How about doing this in the context of the

Oracle DevRel 55 Oct 24, 2022
Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

Alexander Goldian 2 Dec 02, 2021
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences. Copula and functional Principle Component Analysis (fPCA) are st

32 Dec 20, 2022
vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

gg I wasn't satisfied with any of the other available Gemini clients, so I wrote my own. Requires Python 3.9 (maybe older, I haven't checked) and opti

RAFAEL RODRIGUES 5 Jan 03, 2023
Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

Hyeong Kyun (Daniel) Park 1 Jun 28, 2022
INFO-H515 - Big Data Scalable Analytics

INFO-H515 - Big Data Scalable Analytics Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group Exercise clas

Yann-Aël Le Borgne 58 Dec 11, 2022
Feature engineering and machine learning: together at last

Feature engineering and machine learning: together at last! Lambdo is a workflow engine which significantly simplifies data analysis by unifying featu

Alexandr Savinov 14 Sep 15, 2022
Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

BigScience Workshop 3 Mar 03, 2022
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023