nicdex @nicdex

Recent searches

Search options

Only available when logged in.

**Coach Pāṇini ®** @paninid@mastodon.world · Mar 26, 2024

Mar 26, 2024

Coach Pāṇini ® @paninid@mastodon.world

@argv_minus_one @russellmcormond @Uair @hosford42 @actuallyautistic @neurodivergence

To eliminate fashy supremacist worldviews from “AI” MIGHT involve such deep curation of the #TrainingData set as to make the entire effort economically unviable.

SHOT: https://www.superversive.co/blog/crystallized-social-relations

CHASER: https://knowingmachines.org/models-all-the-way

SuperversiveCrystallized Social Relations — SuperversiveThe importance of Data Governance on Product Management seemed obvious to me. I went down a rabbithole trying to articulate terms like “technostructure” and “networked society” to explain why. Sam McAfee summed it up nicely: “Algorithms have been around since antiquity; the digital computer m

**Russell McOrmond** @russellmcormond@spore.social · Mar 26, 2024

Mar 26, 2024

Russell McOrmond @russellmcormond@spore.social

@paninid @argv_minus_one @Uair @hosford42 @actuallyautistic @neurodivergence

It depends on what you are trying to use the LLM for.

I'm not sure what the purpose of injecting random data that is only curated by language. Even for "English", which English are you curationg? What population?

Even the notion that the worldviews of all English speaking populations is "universal" is false, and thus an uncurrated data set is GIGO (Garbage In, Garbage Out). It's an inappropriate use of the tech.

**Coach Pāṇini ®** @paninid@mastodon.world · Mar 26, 2024 *

Mar 26, 2024 *

Coach Pāṇini ® @paninid@mastodon.world

@russellmcormond @argv_minus_one @Uair @hosford42

I have a feeling that uncurated #TrainingData sets are common, and now “user-generated data” is like pre-war steel in Geiger counters.

The well has already been poisoned, and we’re all just waiting to feel symptoms.

https://www.superversive.co/blog/synthetic-chernobyl

SuperversiveSynthetic Chernobyl — SuperversiveMy cyberpunk pastime in Midjourney is to imagine thousands of salt-and-thorium mini-reactors powering desalination plants supporting walkable American villages with e-rickshaws. Microsoft is training language models to generate documentation to build nuclear reactors. That is the solarpunk future

Aaron @hosford42@techhub.social

@paninid @russellmcormond @argv_minus_one @Uair

> I have a feeling that uncurated #TrainingData sets are common

I work in this field and can confirm that this is the case. I have rarely seen discussion on the matter in the workplace, either.

Mar 27, 2024, 03:33 PM··Tusky

2boosts·7favorites

**Aaron** @hosford42 · Mar 27, 2024

Mar 27, 2024

Aaron @hosford42

@paninid @russellmcormond @argv_minus_one @Uair Most discussion centers around whether the corporate project can reach its business goals using the data set. Questions focus on whether the data has the needed features and labels (the structure of the data), whether the data is comprehensive enough, whether the sample distribution is similar to the intended use, whether class imbalances (too much of one label and not enough of another) exist, and how clean or dirty the data is w.r.t. noise or mislabeling, but not the potential biases of that data. Data quality is generally assessed by some engineer looking at a few samples and forming a personal opinion. The vast majority of projects I have been involved in were "fly by the seat of your pants", and it was a struggle to get people to pay attention even to the majority of the questions I just listed.

**Aaron** @hosford42 · Mar 27, 2024

Mar 27, 2024

Aaron @hosford42

@paninid @russellmcormond @argv_minus_one @Uair The field of data science is in its infancy. People are still trying to figure out the right questions to ask about a data set. Many of the concepts are only intuitively defined, and the language is rapidly evolving. We are still in the wild west phase, where individual efforts to maintain order and justice are the primary source, and coverage is spotty at best. Small shops spring up throughout the corporate sphere, slapping tools together haphazardly to get to some ill-defined end goal with little understanding. The bigger ML focused groups are better, but still not good. With one of the largest teams I worked on, I struggled just to get them to see that overlap between their training and test data was an issue, much less to address it. Data was an afterthought.

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back