{
"cells": [
{
"cell_type": "markdown",
"id": "7f1f5a4a-ae34-4a27-9efa-399edc0e384a",
"metadata": {
"tags": []
},
"source": [
"## Benchmark: ClickHouse Vs. InfluxDB Vs. Postgresql Vs. Parquet \n",
"\n",
"-----\n",
"\n",
"#### How to use:\n",
"* Rename the file \"properties-model.ini\" to \"properties.ini\"\n",
"* Fill with your own credentials\n",
"----\n",
"\n",
"The proposal of this work is to compare the speed in read/writing a midle level of data ( a dataset with 9 columns and 50.000 lines) to four diferent databases:\n",
"* ClickHouse\n",
"* InfluxDB\n",
"* Postgresql\n",
"* Parquet (in a S3 Minio Storage)
\n",
"ToDo:
\n",
"* DuckDB with Polars\n",
"* MongoDB\n",
"* Kdb+\n",
"\n",
" \n",
"Deve-se relevar:\n",
"é uma \"cold-storage\" ou \"frezze-storage\"?
\n",
"influxdb: alta leitura e possui a vantagem da indexaçõa para vizualização de dados em gráficos.\n",
"\n",
"notas: \n",
"* comparar tamanho do csv com parquet"
]
},
{
"cell_type": "markdown",
"id": "6bb26ce7-1e84-4665-accd-916bb977f95d",
"metadata": {
"tags": []
},
"source": [
"### Imports "
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "ab6c6c81-6ac1-4668-a79b-a9a0341fb35a",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import configparser\n",
"from datetime import datetime\n",
"\n",
"import duckdb\n",
"import influxdb_client\n",
"import pandas as pd\n",
"\n",
"# import pymongo\n",
"from clickhouse_driver import Client\n",
"from dotenv import load_dotenv\n",
"from minio import Minio\n",
"from pymongo import MongoClient\n",
"from pytz import timezone\n",
"from sqlalchemy import create_engine\n",
"\n",
"load_dotenv()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55c3cd57-0996-4723-beb5-8f3196c96009",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Variables\n",
"dbname = \"EURUSDtest\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "968403e3-2e5e-4834-b969-be4600e2963a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"arq = configparser.RawConfigParser()\n",
"arq.read(\"properties.ini\")\n",
"ClickHouseUser = arq.get(\"CLICKHOUSE\", \"user\")\n",
"ClickHouseKey = arq.get(\"CLICKHOUSE\", \"key\")\n",
"ClickHouseUrl = arq.get(\"CLICKHOUSE\", \"url\")\n",
"\n",
"InfluxDBUser = arq.get(\"INFLUXDB\", \"user\")\n",
"InfluxDBKey = arq.get(\"INFLUXDB\", \"key\")\n",
"InfluxDBUrl = arq.get(\"INFLUXDB\", \"url\")\n",
"InfluxDBBucket = arq.get(\"INFLUXDB\", \"bucket\")\n",
"\n",
"PostgresqlUser = arq.get(\"POSTGRESQL\", \"user\")\n",
"PostgresqlKey = arq.get(\"POSTGRESQL\", \"key\")\n",
"PostgresqlUrl = arq.get(\"POSTGRESQL\", \"url\")\n",
"PostgresqlDB = arq.get(\"POSTGRESQL\", \"database\")\n",
"\n",
"S3MinioUser = arq.get(\"S3MINIO\", \"user\")\n",
"S3MinioKey = arq.get(\"S3MINIO\", \"key\")\n",
"S3MinioUrl = arq.get(\"S3MINIO\", \"url\")\n",
"S3MinioRegion = arq.get(\"S3MINIO\", \"region\")\n",
"\n",
"MongoUser = arq.get(\"MONGODB\", \"user\")\n",
"MongoKey = arq.get(\"MONGODB\", \"key\")\n",
"MongoUrl = arq.get(\"MONGODB\", \"url\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3634a4ec-04c2-4f1e-8659-5d22eb17a254",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"# Load Dataset\n",
"df = pd.read_csv(\"out.csv\", index_col=0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e7c46b6-90ee-4ca3-8b5a-553b09ece913",
"metadata": {},
"outputs": [],
"source": [
"# df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76199f91-31d6-416b-9f15-5d435b3792c9",
"metadata": {},
"outputs": [],
"source": [
"df[\"from\"] = pd.to_datetime(df[\"from\"], unit=\"s\")\n",
"df[\"to\"] = pd.to_datetime(df[\"to\"], unit=\"s\")\n",
"# Optional use when not transoformed yet\n",
"# Transform Datetime"
]
},
{
"cell_type": "markdown",
"id": "274cc026-2f48-4e38-b80f-b1a9ff982060",
"metadata": {
"tags": []
},
"source": [
"#### Funçoes\n",
"\n",
"-> Class"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27de1ec8-4de1-440a-b555-b4a46c5ef7ce",
"metadata": {},
"outputs": [],
"source": [
"def timestamp2dataHora(x, timezone_=\"America/Sao_Paulo\"):\n",
" d = datetime.fromtimestamp(x, tz=timezone(timezone_))\n",
" return d"
]
},
{
"cell_type": "markdown",
"id": "4a8d5703-9bc9-4d38-83ff-457159304d58",
"metadata": {
"tags": []
},
"source": [
"### ClickHouse"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9cf86669-7722-4a2c-895c-51f0a5eebefc",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# !! O client oficial usa um driver http, nesse exemplo vamos usar a biblioteca\n",
"# de terceirtos clickhouse_driver recomendada, por sua vez que usa tcp.\n",
"client = Client(\n",
" host=ClickHouseUrl,\n",
" user=ClickHouseUser,\n",
" password=ClickHouseKey,\n",
" settings={\"use_numpy\": True},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0a1f67b-2e63-462e-be66-d322d99837ea",
"metadata": {},
"outputs": [],
"source": [
"# Create Tables in ClickHouse\n",
"# !! ALTERAR TIPOS !!\n",
"# ENGINE: 'Memory' desaparece quando server é reiniciado\n",
"client.execute(\n",
" \"CREATE TABLE IF NOT EXISTS {} (id UInt32,\"\n",
" \"from DateTime, at UInt64, to DateTime, open Float64,\"\n",
" \"close Float64, min Float64, max Float64, volume UInt32)\"\n",
" \"ENGINE MergeTree ORDER BY to\".format(dbname)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a029a09-46f4-43c3-b3df-cfbed33fb0dc",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"# Write dataframe to db\n",
"client.insert_dataframe(\"INSERT INTO {} VALUES\".format(dbname), df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "17251288-2442-43ee-98f2-ca680c3c4f13",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"client.query_dataframe(\"SELECT * FROM default.{}\".format(dbname)) # LIMIT 10000"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "51497522-bd6c-44a8-aaea-ec5dda30b95b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# %%time\n",
"# df = pd.DataFrame(client.query_dataframe(\"SELECT * FROM default.{}\".format(dbname)))"
]
},
{
"cell_type": "markdown",
"id": "1d389546-911f-43f7-aad1-49f7bcc83503",
"metadata": {
"tags": []
},
"source": [
"### InfluxDB\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3e7ebfd-76f1-4ac4-9833-312eb1a531af",
"metadata": {},
"outputs": [],
"source": [
"client = influxdb_client.InfluxDBClient(\n",
" url=InfluxDBUrl, token=InfluxDBKey, org=InfluxDBUser\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbf61f12-830b-4c57-804a-2257d8b3599a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Read data from CSV without index and parse 'TimeStamp' as date.\n",
"df = pd.read_csv(\"out.csv\", sep=\",\", index_col=False, parse_dates=[\"from\"])\n",
"# Set 'TimeStamp' field as index of dataframe # test another indexs\n",
"df.set_index(\"from\", inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54342a28-ba2b-4ade-a692-00566b53a639",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f861fab2-f1b1-49dd-b758-12d10aef3462",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"# gravando... demorou... mas deu certo\n",
"with client.write_api() as writer:\n",
" writer.write(\n",
" bucket=InfluxDBBucket,\n",
" record=df,\n",
" data_frame_measurement_name=\"id\",\n",
" data_frame_tag_columns=[\"volume\"],\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0bb2563d-68e2-4ff4-8842-70ac730dc6b1",
"metadata": {},
"outputs": [],
"source": [
"# data\n",
"# |> pivot(\n",
"# rowKey:[\"_time\"],\n",
"# columnKey: [\"_field\"],\n",
"# valueColumn: \"_value\"\n",
"# )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb1596f9-4cee-4642-803a-ee61c9dddf64",
"metadata": {},
"outputs": [],
"source": [
"# Read"
]
},
{
"cell_type": "markdown",
"id": "b9ddfdc6-c899-4f6c-9b4e-8ec6ab6d7e05",
"metadata": {
"tags": []
},
"source": [
"### Postgresql"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16cd8eb7-333d-43fd-88e0-ee983645d3fd",
"metadata": {},
"outputs": [],
"source": [
"# Connect / Create Tables\n",
"engine = create_engine(\n",
" \"postgresql+psycopg2://{}:{}@{}:5432/{}\".format(\n",
" PostgresqlUser, PostgresqlKey, PostgresqlUrl, PostgresqlDB\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "be31f3a0-b7ed-48e6-9b65-dc16319fb8d1",
"metadata": {},
"outputs": [],
"source": [
"# Drop old table and create new empty table\n",
"df.head(0).to_sql(\"comparedbs\", engine, if_exists=\"replace\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7883c4d-4609-4380-8a45-246b7ca2f9c5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"# Write\n",
"conn = engine.raw_connection()\n",
"cur = conn.cursor()\n",
"output = io.StringIO()\n",
"df.to_csv(output, sep=\"\\t\", header=False, index=False)\n",
"output.seek(0)\n",
"contents = output.getvalue()\n",
"\n",
"cur.copy_from(output, \"comparedbs\") # , null=\"\") # null values become ''\n",
"conn.commit()\n",
"cur.close()\n",
"conn.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e37a93e1-fc0e-4d27-9e16-dca6c8aea324",
"metadata": {},
"outputs": [],
"source": [
"# Read"
]
},
{
"cell_type": "markdown",
"id": "f9e0393d-7d1d-406a-a068-9dbf4968e977",
"metadata": {
"tags": []
},
"source": [
"### S3 Parquet"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "60a990e2-4607-4654-84ec-17d4985adae2",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# fazer sem funçao para ver se melhora\n",
"# verifique se esta no ssd os arquivos da pasta git\n",
"def main():\n",
" client = Minio(\n",
" S3MinioUrl,\n",
" secure=False,\n",
" region=S3MinioRegion,\n",
" access_key=\"MatMPA7NyHltz7DQ\",\n",
" secret_key=\"SO1IG5iBPSjNPZanYUaHCLcoSbjphLCP\",\n",
" )\n",
"\n",
" # Make bucket if not exist.\n",
" found = client.bucket_exists(\"data\")\n",
" if not found:\n",
" client.make_bucket(\"data\")\n",
" else:\n",
" print(\"Bucket 'data' already exists\")\n",
"\n",
" # Upload\n",
" client.fput_object(\n",
" \"data\",\n",
" \"data.parquet\",\n",
" \"data/data.parquet\",\n",
" )\n",
" # print(\n",
" # \"'data/data.parquet' is successfully uploaded as \"\n",
" # \"object 'data.parquet' to bucket 'data'.\"\n",
" # )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "390918c8-c88f-404a-96c4-685d578fdad0",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"df.to_parquet(\"data/data.parquet\")\n",
"if __name__ == \"__main__\":\n",
" try:\n",
" main()\n",
" except S3Error as exc:\n",
" print(\"error occurred.\", exc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9e07143-8c11-4b68-a869-c3922cda9092",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"pq = pd.read_parquet(\"data/data.parquet\", engine=\"pyarrow\")\n",
"pq.head()"
]
},
{
"cell_type": "markdown",
"id": "50d1fc58-89a7-4507-aff0-6e943656cfe0",
"metadata": {
"tags": []
},
"source": [
"### MongoDB"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d104d9af-fa34-4261-8478-329a28ee4f2e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Load csv dataset\n",
"data = pd.read_csv(\"out.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0af8f72c-5b58-4dfc-af36-c5b4bc79f127",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Connect to MongoDB\n",
"client = MongoClient(\n",
" # \"mongodb://192.168.1.133:27017\"\n",
" \"mongodb://{}:{}@{}/EURUSDtest?retryWrites=true&w=majority\".format(\n",
" MongoUser, MongoKey, MongoUrl\n",
" ),\n",
" authSource=\"admin\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1b20d15-f5af-463c-813f-ffae61119de1",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"db = client[\"EUROUSDtest\"]\n",
"collection = db[\"finance\"]\n",
"# data.reset_index(inplace=True)\n",
"data_dict = data.to_dict(\"records\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70674d23-f375-4659-87ec-c745dec96d54",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"# Insert collection\n",
"collection.insert_many(data_dict)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81a4a33d-5914-45d8-af4e-2b0aabd2ac38",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# read"
]
},
{
"cell_type": "markdown",
"id": "97405e42-61dc-42c7-8220-237a312c0ec7",
"metadata": {
"tags": []
},
"source": [
"### DuckDB"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bbcdb883-d6dc-46db-88db-4c90b84522ba",
"metadata": {},
"outputs": [],
"source": [
"cursor = duckdb.connect()\n",
"print(cursor.execute(\"SELECT 42\").fetchall())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35025a6e-9dc7-46cf-a792-76b3d84f1ac0",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"conn = duckdb.connect()\n",
"data = pd.read_csv(\"out.csv\")\n",
"conn.register(\"EURUSDtest\", data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6abdaaa-3ac2-425b-9208-d6cb79afe966",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"display(conn.execute(\"SHOW TABLES\").df())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2acce0f3-f0b2-47d0-8e0d-f9e9687efc18",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"df = conn.execute(\"SELECT * FROM EURUSDtest\").df()\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "4409cc89-ed14-4313-ac89-65b826038533",
"metadata": {
"tags": []
},
"source": [
"### Kdb+"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "14f63810-1943-4e28-9bce-2148be6be02d",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"np.bool = np.bool_\n",
"from qpython import qconnection"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "8ff6c090-7e02-435a-a179-f2aab81da972",
"metadata": {},
"outputs": [],
"source": [
"# read csv\n",
"data = pd.read_csv(\"out.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 71,
"id": "b4eb8ab9-81e8-4732-8cf7-51f0981d3d57",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# open connection\n",
"q = qconnection.QConnection(host=\"localhost\", port=5001)\n",
"q.open()"
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "97cb6b5b-65a5-46a0-a4ee-e5c535a716ab",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 925 ms, sys: 40 ms, total: 965 ms\n",
"Wall time: 1.43 s\n"
]
}
],
"source": [
"%%time\n",
"# send df to kd+ in memory bank\n",
"q.sendSync(\"{t::x}\", data)"
]
},
{
"cell_type": "code",
"execution_count": 76,
"id": "c2ed2d51-bc8e-4207-892a-35fc55d43570",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b':/home/sandman/q/tab1'"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# write to on disk table\n",
"q.sendSync(\"`:/home/sandman/q/tab1 set t\")"
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "9c055a95-f73f-43a3-8fbd-61e42235117e",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.94 ms, sys: 1 µs, total: 1.94 ms\n",
"Wall time: 426 ms\n"
]
}
],
"source": [
"%%time\n",
"# read from on disk table\n",
"df2 = q.sendSync(\"tab2: get `:/home/sandman/q/tab1\")"
]
},
{
"cell_type": "code",
"execution_count": 78,
"id": "9760de38-9f04-4322-bfff-c7ee12d5dee5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# print(df2)"
]
},
{
"cell_type": "code",
"execution_count": 79,
"id": "c06c9222-c69d-4872-9d21-052281a013e2",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.08 s, sys: 116 ms, total: 1.2 s\n",
"Wall time: 1.27 s\n"
]
}
],
"source": [
"%%time\n",
"# load to variable df2\n",
"df2 = q.sendSync(\"tab2\")"
]
},
{
"cell_type": "code",
"execution_count": 80,
"id": "8815f01c-fd0a-4f94-ab7f-f8ede84ba4e7",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# df2(type)"
]
},
{
"cell_type": "code",
"execution_count": 82,
"id": "e6ed3927-4395-45cd-9a28-88c5db01f2e5",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.25 s, sys: 132 ms, total: 1.39 s\n",
"Wall time: 1.46 s\n"
]
},
{
"data": {
"text/html": [
"
| \n", " | Unnamed: 0 | \n", "id | \n", "from | \n", "at | \n", "to | \n", "open | \n", "close | \n", "min | \n", "max | \n", "volume | \n", "
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0 | \n", "7730801 | \n", "b'2023-01-02 15:58:45' | \n", "1672675140000000000 | \n", "b'2023-01-02 15:59:00' | \n", "1.065995 | \n", "1.066035 | \n", "1.065930 | \n", "1.066070 | \n", "57 | \n", "
| 1 | \n", "1 | \n", "7730802 | \n", "b'2023-01-02 15:59:00' | \n", "1672675155000000000 | \n", "b'2023-01-02 15:59:15' | \n", "1.066055 | \n", "1.066085 | \n", "1.066005 | \n", "1.066115 | \n", "52 | \n", "
| 2 | \n", "2 | \n", "7730803 | \n", "b'2023-01-02 15:59:15' | \n", "1672675170000000000 | \n", "b'2023-01-02 15:59:30' | \n", "1.066080 | \n", "1.066025 | \n", "1.066025 | \n", "1.066110 | \n", "57 | \n", "
| 3 | \n", "3 | \n", "7730804 | \n", "b'2023-01-02 15:59:30' | \n", "1672675185000000000 | \n", "b'2023-01-02 15:59:45' | \n", "1.065980 | \n", "1.065985 | \n", "1.065885 | \n", "1.066045 | \n", "64 | \n", "
| 4 | \n", "4 | \n", "7730805 | \n", "b'2023-01-02 15:59:45' | \n", "1672675200000000000 | \n", "b'2023-01-02 16:00:00' | \n", "1.065975 | \n", "1.066055 | \n", "1.065830 | \n", "1.066055 | \n", "50 | \n", "