techhub.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A hub primarily for passionate technologists, but everyone is welcome

Administered by:

Server stats:

4.6K
active users

#datawarehouse

0 posts0 participants0 posts today

Could data segregation help mitigate impact of large scale data incidents?

Looking at the Qantas breach of 6 million passenger records.

Taking a step back from the data warehouse model, what if data could be stored in different locations based on a set of criteria instead of in a single repository. Access to these systems could be isolated as well. If one system got compromised it would not impact the entire data set.

The data could still be mined for business analytics but it could be pseudonymized in a data warehouse. If access to the warehouse got compromised it would not impact privacy.

This is a much more complex and expensive setup, but the cost could be weighed against the loss resulting from a compromise.

There is also the impact on real time data interactions with PII, where is it stored, how is it accessed, etc. Lots of considerations.

Just a thought, though it may not be practical.

🚀 Join Our Free Demo on Snowflake Online Training!
Ready to upgrade your data skills with the cloud-based data platform trusted by top companies worldwide? Don’t miss our FREE Demo Session!
👉Attend Online #FreeDemo On #Snowflake by Mr. Krishna.
📅Demo on: 29th May, 2025 @ 7:00 AM (IST).
📲Contact us: +91 7032290546
🟢WhatsApp: wa.me/c/917032290546
🌐Visit: visualpath.in/snowflake-traini
📝 Blog: visualpathblogs.com/category/s

I’ve been working on a pretty gnarly data a warehouse reporting problem for the past few days. It’s up, leveling my ability to do this kind of work. The tooling has always been so limited and I am beginning to understand it is me who is limited in the understanding of the tooling ecosystem.

There may or may not be a wonderful overlap of programming and data warehousing but it’s clear that me not being aware of it doesn’t mean it doesn’t exist.

I just discovered that Snowflake (the company) has its name not because it makes it possible to create a beautiful logo of a snowflake, but because Snowflake Schema is a pattern for storing information in Data Warehouses (we also have Star Schema).

An analysis of 100 Fortune 500 job postings reveals the tools and technologies shaping the data engineering field in 2025. Top skills in demand:
⁕ Programming Languages (196) - SQL (85), Python (76), Scala (14), Java (14)
⁕ ETL and Data Pipeline (136) - ETL (65), Data Integration (46)
⁕ Cloud Platforms (85) - AWS (45), GCP (26), Azure (14)
⁕ Data Modeling and Warehousing (83) - Data Modeling (40), Data Warehousing (22), Data Architecture (21)
⁕ Big Data Tools (67) - Spark (40), Big Data Tools (19), Hadoop (8)
⁕ DevOps, Version Control, and CI/CD (52) - Git (14), CI/CD (13), DevOps (7), Version Control (6), Terraform (6)
...

reddit.com/r/dataengineering/c

Part2:

I split all columns to strings and numeric by converting
with Pands function pd.to_numeric and checking if errors
happens.

In PowerBI I download one table with date indexes for
slices and create second table with latest slice.

SQLAlchemy
dtype_mapping = {
'object': String,
'float64': Float,
'int64': Integer,
'datetime64[ns]': DateTime,
'datetime64': DateTime
}

Part1:

At this week I installed PowerBI and connect it to remote
PostgreSQL.
I asked AI to compare open-source data sources for
PowerBI and compare them by:
- Ease of Setup on Linux: SQLite > PostgreSQL > MySQL >
Redis > MongoDB
- Performance:
+ For large datasets: MongoDB > PostgreSQL > MySQL >
Redis > SQLite.
+ For real-time operations: Redis > MongoDB > MySQL >
PostgreSQL > SQLite.

For PostgreSQL I prepare data in Python script that use:
- pandas - for coverting types to datetime and numeric
- sqlalchemy - for simplifying type converstion
- asyncpg - sqlalchemy backend to connect to PostgreSQL

Talend, probably the only mature open source Extract Transform Load (ETL) tool to work with data, is no longer maintained and is retired :ablobcatcry:

Apparently one year ago Qlik, which owns Talend, said open-source version of Talend Studio "does not contribute to Qlik's commercial products".

It's so sad because DBT, which some dare to call ETL tool (in fact it is more of a templating engine) is far from functionality an ETL tool is supposed to offer :blobcatthink:

One of the most highlighted parts: "There is no need to move data. Data latency is minimised. Data can be transformed and analysed within a single platform.“

This is one of the reasons for 'Why ETL-Zero' :blobcoffee:

towardsdatascience.com/why-etl

Towards Data Science · Why ETL-Zero? Understanding the Shift in Data IntegrationBy Sarah Lea

In a data warehouse you store structured & organized data. In a data lake you can additionally store unstructured data. And was is now a data lakehouse?

Think of a combination of the strengths of both previous data platforms. :blobcoffee:

towardsdatascience.com/sql-and

Towards Data Science · SQL and Data Modelling in Action: A Deep Dive into Data LakehousesBy Sarah Lea