techhub.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A hub primarily for passionate technologists, but everyone is welcome

Administered by:

Server stats:

5.4K
active users

@paninid @argv_minus_one @Uair @hosford42 @actuallyautistic @neurodivergence

It depends on what you are trying to use the LLM for.

I'm not sure what the purpose of injecting random data that is only curated by language. Even for "English", which English are you curationg? What population?

Even the notion that the worldviews of all English speaking populations is "universal" is false, and thus an uncurrated data set is GIGO (Garbage In, Garbage Out). It's an inappropriate use of the tech.

Aaron

@paninid @russellmcormond @argv_minus_one @Uair

> I have a feeling that uncurated sets are common

I work in this field and can confirm that this is the case. I have rarely seen discussion on the matter in the workplace, either.

@paninid @russellmcormond @argv_minus_one @Uair Most discussion centers around whether the corporate project can reach its business goals using the data set. Questions focus on whether the data has the needed features and labels (the structure of the data), whether the data is comprehensive enough, whether the sample distribution is similar to the intended use, whether class imbalances (too much of one label and not enough of another) exist, and how clean or dirty the data is w.r.t. noise or mislabeling, but not the potential biases of that data. Data quality is generally assessed by some engineer looking at a few samples and forming a personal opinion. The vast majority of projects I have been involved in were "fly by the seat of your pants", and it was a struggle to get people to pay attention even to the majority of the questions I just listed.

@paninid @russellmcormond @argv_minus_one @Uair The field of data science is in its infancy. People are still trying to figure out the right questions to ask about a data set. Many of the concepts are only intuitively defined, and the language is rapidly evolving. We are still in the wild west phase, where individual efforts to maintain order and justice are the primary source, and coverage is spotty at best. Small shops spring up throughout the corporate sphere, slapping tools together haphazardly to get to some ill-defined end goal with little understanding. The bigger ML focused groups are better, but still not good. With one of the largest teams I worked on, I struggled just to get them to see that overlap between their training and test data was an issue, much less to address it. Data was an afterthought.