nicdex @nicdex

Recent searches

Search options

Only available when logged in.

SQLAllFather @SQLAllFather@techhub.social

Hey Fediverse - does anyone know how to work in Python (PySpark) or Scala with files that do not have a file extension?

I am working with a large number of tab-delimited text files that are produced by a 3rd party and which do not have any file extension.

For example, a file that would logically be called "customerdata.tsv" is instead called simply "customerdata".

In my notebook this works, but only if I manually rename the source file:

df = spark.read.csv("customerdata.tsv", sep=r'\t')

This does not work:

df = spark.read.csv("customerdata", sep=r'\t')

I'm hoping to avoid needing to rename all of the ~200 source files to get this to work. My public searching has not produced anything useful - can anyone here point me in the right direction?

Thanks in advance!

Edit: This ended up being user error on my part - I was trying to read from the wrong folder where the "no extension" file didn't actually exist. Thanks to all who replied!

#Python #PySpark #Scala

Jun 15, 2023, 12:19 AM··Web

2boosts·1favorite

**Pyranose** @pyranose@fosstodon.org · Jun 15, 2023

Jun 15, 2023

Pyranose @pyranose@fosstodon.org

@SQLAllFather

I might be missing something in your issue but why don't you write a short script to rename the files in bulk?

**Paul Pop Jr** @pmpjr@dataplatform.social · Jun 15, 2023

Jun 15, 2023

Paul Pop Jr @pmpjr@dataplatform.social

@pyranose @SQLAllFather of it really matter matter, I’ve used random scripts or third party apps in the past to mass rename stuff. I’d agree tack some extensions on them and go on. If you had to dupe the folded they live in first just to be safe could do that too.

**Jeremy** @jeremygsk@fosstodon.org · Jun 15, 2023

Jun 15, 2023

Jeremy @jeremygsk@fosstodon.org

@SQLAllFather My guess is that spark is mistaking it for a directory. If all your files have the same columns, maybe just try reading the parent directory?

But I agree with the reply above: just writing a script to rename the files seems to be the most straightforward solution.

C. @cazabon@mindly.social · Jun 15, 2023

Jun 15, 2023

C. @cazabon@mindly.social

@SQLAllFather

This must be specific to spark. Python's stdlib csv module doesn't care what the filename is, or even if it has one.

Perhaps there's an additional param to that spark function that lets you override detection-by-file-extension? Haven't used spark myself.

Otherwise, a shell one-liner will do it:

for file in ./somedir/* ; do mv "$file" "$file.tsv" ; done

**Tim Newman** @timdnewman@mastodon.social · Jun 15, 2023

Jun 15, 2023

Tim Newman @timdnewman@mastodon.social

@SQLAllFather does it work if you use pathlib to explicitly make it a path first i.e.

from pathlib import Path
df = spark.read.csv(Path("customerdata"), sep=r'\t')

**SQLAllFather** @SQLAllFather · Jun 15, 2023

Jun 15, 2023

SQLAllFather @SQLAllFather

@timdnewman

Thank you!

While experimenting with this approach I discovered that the root error was on my end, and that I'd entered an invalid path - I was trying to work with a file that didn't exist. I guess this will teach me to sleep on errors before reaching out for help.

**Tim Newman** @timdnewman@mastodon.social · Jun 15, 2023

Jun 15, 2023

Tim Newman @timdnewman@mastodon.social

@SQLAllFather oh yes that would also do it :)

Glad to be of service!

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back