{
"cells": [
{
"cell_type": "markdown",
"id": "7f1f5a4a-ae34-4a27-9efa-399edc0e384a",
"metadata": {
"tags": []
},
"source": [
"## Benchmark: ClickHouse Vs. InfluxDB Vs. Postgresql Vs. Parquet \n",
"\n",
"-----\n",
"\n",
"#### How to use:\n",
"* Rename the file \"properties-model.ini\" to \"properties.ini\"\n",
"* Fill with your own credentials\n",
"----\n",
"\n",
"The proposal of this work is to compare the speed in read/writing a midle level of data ( a dataset with 9 columns and 50.000 lines) to four diferent databases:\n",
"* ClickHouse\n",
"* InfluxDB\n",
"* Postgresql\n",
"* Parquet (in a S3 Minio Storage)
\n",
"ToDo:
\n",
"* DuckDB with Polars\n",
"* MongoDB\n",
"* Kdb+\n",
"\n",
" \n",
"Deve-se relevar:\n",
"é uma \"cold-storage\" ou \"frezze-storage\"?
\n",
"influxdb: alta leitura e possui a vantagem da indexaçõa para vizualização de dados em gráficos.\n",
"\n",
"notas: \n",
"* comparar tamanho do csv com parquet"
]
},
{
"cell_type": "markdown",
"id": "6bb26ce7-1e84-4665-accd-916bb977f95d",
"metadata": {
"tags": []
},
"source": [
"### Imports "
]
},
{
"cell_type": "code",
"execution_count": 74,
"id": "ab6c6c81-6ac1-4668-a79b-a9a0341fb35a",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import configparser\n",
"from datetime import datetime\n",
"\n",
"import influxdb_client\n",
"import pandas as pd\n",
"from clickhouse_driver import Client\n",
"from dotenv import load_dotenv\n",
"from minio import Minio\n",
"from pymongo import MongoClient\n",
"from pytz import timezone\n",
"from sqlalchemy import create_engine\n",
"\n",
"load_dotenv()\n",
"\n",
"\n",
"# import io\n",
"# import time\n",
"# import numpy as np\n",
"# import clickhouse_connect\n",
"# pip install python-dotenv\n",
"# import psycopg2\n",
"# import os\n",
"# import pyarrow as pa\n",
"# import pyarrow.parquet as pq\n",
"# import s3fs\n",
"# from friendly.jupyter import Friendly\n",
"# from minio.error import S3Error\n",
"# from pyarrow import Table\n",
"# import os\n",
"# from influxdb_client import InfluxDBClient, Point, WritePrecision\n",
"# from influxdb_client.client.write_api import SYNCHRONOUS\n",
"# Friendly.dark()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "01d88282-32a1-404f-92da-488a23302fd0",
"metadata": {},
"outputs": [],
"source": [
"# teset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55c3cd57-0996-4723-beb5-8f3196c96009",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Variables\n",
"dbname = \"EURUSDtest\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "968403e3-2e5e-4834-b969-be4600e2963a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"arq = configparser.RawConfigParser()\n",
"arq.read(\"properties.ini\")\n",
"ClickHouseUser = arq.get(\"CLICKHOUSE\", \"user\")\n",
"ClickHouseKey = arq.get(\"CLICKHOUSE\", \"key\")\n",
"ClickHouseUrl = arq.get(\"CLICKHOUSE\", \"url\")\n",
"\n",
"InfluxDBUser = arq.get(\"INFLUXDB\", \"user\")\n",
"InfluxDBKey = arq.get(\"INFLUXDB\", \"key\")\n",
"InfluxDBUrl = arq.get(\"INFLUXDB\", \"url\")\n",
"InfluxDBBucket = arq.get(\"INFLUXDB\", \"bucket\")\n",
"\n",
"PostgresqlUser = arq.get(\"POSTGRESQL\", \"user\")\n",
"PostgresqlKey = arq.get(\"POSTGRESQL\", \"key\")\n",
"PostgresqlUrl = arq.get(\"POSTGRESQL\", \"url\")\n",
"PostgresqlDB = arq.get(\"POSTGRESQL\", \"database\")\n",
"\n",
"S3MinioUser = arq.get(\"S3MINIO\", \"user\")\n",
"S3MinioKey = arq.get(\"S3MINIO\", \"key\")\n",
"S3MinioUrl = arq.get(\"S3MINIO\", \"url\")\n",
"S3MinioRegion = arq.get(\"S3MINIO\", \"region\")\n",
"\n",
"MongoUser = arq.get(\"MONGODB\", \"user\")\n",
"MongoKey = arq.get(\"MONGODB\", \"key\")\n",
"MongoUrl = arq.get(\"MONGODB\", \"url\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3634a4ec-04c2-4f1e-8659-5d22eb17a254",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"# Load Dataset\n",
"df = pd.read_csv(\"out.csv\", index_col=0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e7c46b6-90ee-4ca3-8b5a-553b09ece913",
"metadata": {},
"outputs": [],
"source": [
"# df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76199f91-31d6-416b-9f15-5d435b3792c9",
"metadata": {},
"outputs": [],
"source": [
"df[\"from\"] = pd.to_datetime(df[\"from\"], unit=\"s\")\n",
"df[\"to\"] = pd.to_datetime(df[\"to\"], unit=\"s\")\n",
"# Optional use when not transoformed yet\n",
"# Transform Datetime"
]
},
{
"cell_type": "markdown",
"id": "274cc026-2f48-4e38-b80f-b1a9ff982060",
"metadata": {
"tags": []
},
"source": [
"#### Funçoes\n",
"\n",
"-> Class"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27de1ec8-4de1-440a-b555-b4a46c5ef7ce",
"metadata": {},
"outputs": [],
"source": [
"def timestamp2dataHora(x, timezone_=\"America/Sao_Paulo\"):\n",
" d = datetime.fromtimestamp(x, tz=timezone(timezone_))\n",
" return d"
]
},
{
"cell_type": "markdown",
"id": "4a8d5703-9bc9-4d38-83ff-457159304d58",
"metadata": {
"tags": []
},
"source": [
"### ClickHouse"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9cf86669-7722-4a2c-895c-51f0a5eebefc",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# !! O client oficial usa um driver http, nesse exemplo vamos usar a biblioteca\n",
"# de terceirtos clickhouse_driver recomendada, por sua vez que usa tcp.\n",
"client = Client(\n",
" host=ClickHouseUrl,\n",
" user=ClickHouseUser,\n",
" password=ClickHouseKey,\n",
" settings={\"use_numpy\": True},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0a1f67b-2e63-462e-be66-d322d99837ea",
"metadata": {},
"outputs": [],
"source": [
"# Create Tables in ClickHouse\n",
"# !! ALTERAR TIPOS !!\n",
"# ENGINE: 'Memory' desaparece quando server é reiniciado\n",
"client.execute(\n",
" \"CREATE TABLE IF NOT EXISTS {} (id UInt32,\"\n",
" \"from DateTime, at UInt64, to DateTime, open Float64,\"\n",
" \"close Float64, min Float64, max Float64, volume UInt32)\"\n",
" \"ENGINE MergeTree ORDER BY to\".format(dbname)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a029a09-46f4-43c3-b3df-cfbed33fb0dc",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"# Write dataframe to db\n",
"client.insert_dataframe(\"INSERT INTO {} VALUES\".format(dbname), df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "17251288-2442-43ee-98f2-ca680c3c4f13",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"client.query_dataframe(\"SELECT * FROM default.{}\".format(dbname)) # LIMIT 10000"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "51497522-bd6c-44a8-aaea-ec5dda30b95b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"df = pd.DataFrame(client.query_dataframe(\"SELECT * FROM default.{}\".format(dbname)))"
]
},
{
"cell_type": "markdown",
"id": "1d389546-911f-43f7-aad1-49f7bcc83503",
"metadata": {
"tags": []
},
"source": [
"### InfluxDB\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3e7ebfd-76f1-4ac4-9833-312eb1a531af",
"metadata": {},
"outputs": [],
"source": [
"client = influxdb_client.InfluxDBClient(\n",
" url=InfluxDBUrl, token=InfluxDBKey, org=InfluxDBUser\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbf61f12-830b-4c57-804a-2257d8b3599a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Read data from CSV without index and parse 'TimeStamp' as date.\n",
"df = pd.read_csv(\"out.csv\", sep=\",\", index_col=False, parse_dates=[\"from\"])\n",
"# Set 'TimeStamp' field as index of dataframe # test another indexs\n",
"df.set_index(\"from\", inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54342a28-ba2b-4ade-a692-00566b53a639",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f861fab2-f1b1-49dd-b758-12d10aef3462",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"# gravando... demorou... mas deu certo\n",
"with client.write_api() as writer:\n",
" writer.write(\n",
" bucket=InfluxDBBucket,\n",
" record=df,\n",
" data_frame_measurement_name=\"id\",\n",
" data_frame_tag_columns=[\"volume\"],\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0bb2563d-68e2-4ff4-8842-70ac730dc6b1",
"metadata": {},
"outputs": [],
"source": [
"# data\n",
"# |> pivot(\n",
"# rowKey:[\"_time\"],\n",
"# columnKey: [\"_field\"],\n",
"# valueColumn: \"_value\"\n",
"# )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb1596f9-4cee-4642-803a-ee61c9dddf64",
"metadata": {},
"outputs": [],
"source": [
"# Read"
]
},
{
"cell_type": "markdown",
"id": "b9ddfdc6-c899-4f6c-9b4e-8ec6ab6d7e05",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"### Postgresql"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16cd8eb7-333d-43fd-88e0-ee983645d3fd",
"metadata": {},
"outputs": [],
"source": [
"# Connect / Create Tables\n",
"engine = create_engine(\n",
" \"postgresql+psycopg2://{}:{}@{}:5432/{}\".format(\n",
" PostgresqlUser, PostgresqlKey, PostgresqlUrl, PostgresqlDB\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "be31f3a0-b7ed-48e6-9b65-dc16319fb8d1",
"metadata": {},
"outputs": [],
"source": [
"# Drop old table and create new empty table\n",
"df.head(0).to_sql(\"comparedbs\", engine, if_exists=\"replace\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7883c4d-4609-4380-8a45-246b7ca2f9c5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"# Write\n",
"conn = engine.raw_connection()\n",
"cur = conn.cursor()\n",
"output = io.StringIO()\n",
"df.to_csv(output, sep=\"\\t\", header=False, index=False)\n",
"output.seek(0)\n",
"contents = output.getvalue()\n",
"\n",
"cur.copy_from(output, \"comparedbs\") # , null=\"\") # null values become ''\n",
"conn.commit()\n",
"cur.close()\n",
"conn.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e37a93e1-fc0e-4d27-9e16-dca6c8aea324",
"metadata": {},
"outputs": [],
"source": [
"# Read"
]
},
{
"cell_type": "markdown",
"id": "f9e0393d-7d1d-406a-a068-9dbf4968e977",
"metadata": {
"tags": []
},
"source": [
"### S3 Parquet"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "60a990e2-4607-4654-84ec-17d4985adae2",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# fazer sem funçao para ver se melhora\n",
"# verifique se esta no ssd os arquivos da pasta git\n",
"def main():\n",
" client = Minio(\n",
" S3MinioUrl,\n",
" secure=False,\n",
" region=S3MinioRegion,\n",
" access_key=\"MatMPA7NyHltz7DQ\",\n",
" secret_key=\"SO1IG5iBPSjNPZanYUaHCLcoSbjphLCP\",\n",
" )\n",
"\n",
" # Make bucket if not exist.\n",
" found = client.bucket_exists(\"data\")\n",
" if not found:\n",
" client.make_bucket(\"data\")\n",
" else:\n",
" print(\"Bucket 'data' already exists\")\n",
"\n",
" # Upload\n",
" client.fput_object(\n",
" \"data\",\n",
" \"data.parquet\",\n",
" \"data/data.parquet\",\n",
" )\n",
" # print(\n",
" # \"'data/data.parquet' is successfully uploaded as \"\n",
" # \"object 'data.parquet' to bucket 'data'.\"\n",
" # )"
]
},
{
"cell_type": "code",
"execution_count": 73,
"id": "390918c8-c88f-404a-96c4-685d578fdad0",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bucket 'data' already exists\n",
"CPU times: user 610 ms, sys: 133 ms, total: 743 ms\n",
"Wall time: 4.05 s\n"
]
}
],
"source": [
"%%time\n",
"df.to_parquet(\"data/data.parquet\")\n",
"if __name__ == \"__main__\":\n",
" try:\n",
" main()\n",
" except S3Error as exc:\n",
" print(\"error occurred.\", exc)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"id": "a9e07143-8c11-4b68-a869-c3922cda9092",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
| \n", " | Unnamed: 0 | \n", "id | \n", "at | \n", "to | \n", "open | \n", "close | \n", "min | \n", "max | \n", "volume | \n", "
|---|---|---|---|---|---|---|---|---|---|
| from | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| 2023-01-02 15:58:45 | \n", "0 | \n", "7730801 | \n", "1672675140000000000 | \n", "2023-01-02 15:59:00 | \n", "1.065995 | \n", "1.066035 | \n", "1.065930 | \n", "1.066070 | \n", "57 | \n", "
| 2023-01-02 15:59:00 | \n", "1 | \n", "7730802 | \n", "1672675155000000000 | \n", "2023-01-02 15:59:15 | \n", "1.066055 | \n", "1.066085 | \n", "1.066005 | \n", "1.066115 | \n", "52 | \n", "
| 2023-01-02 15:59:15 | \n", "2 | \n", "7730803 | \n", "1672675170000000000 | \n", "2023-01-02 15:59:30 | \n", "1.066080 | \n", "1.066025 | \n", "1.066025 | \n", "1.066110 | \n", "57 | \n", "
| 2023-01-02 15:59:30 | \n", "3 | \n", "7730804 | \n", "1672675185000000000 | \n", "2023-01-02 15:59:45 | \n", "1.065980 | \n", "1.065985 | \n", "1.065885 | \n", "1.066045 | \n", "64 | \n", "
| 2023-01-02 15:59:45 | \n", "4 | \n", "7730805 | \n", "1672675200000000000 | \n", "2023-01-02 16:00:00 | \n", "1.065975 | \n", "1.066055 | \n", "1.065830 | \n", "1.066055 | \n", "50 | \n", "