Hey Fediverse - does anyone know how to work in Python (PySpark) or Scala with files that do not have a file extension?
I am working with a large number of tab-delimited text files that are produced by a 3rd party and which do not have any file extension.
For example, a file that would logically be called "customerdata.tsv" is instead called simply "customerdata".
In my notebook this works, but only if I manually rename the source file:
df = spark.read.csv("customerdata.tsv", sep=r'\t')
This does not work:
df = spark.read.csv("customerdata", sep=r'\t')
I'm hoping to avoid needing to rename all of the ~200 source files to get this to work. My public searching has not produced anything useful - can anyone here point me in the right direction?
Thanks in advance!
Edit: This ended up being user error on my part - I was trying to read from the wrong folder where the "no extension" file didn't actually exist. Thanks to all who replied!
I might be missing something in your issue but why don't you write a short script to rename the files in bulk?
@pyranose @SQLAllFather of it really matter matter, I’ve used random scripts or third party apps in the past to mass rename stuff. I’d agree tack some extensions on them and go on. If you had to dupe the folded they live in first just to be safe could do that too.
@SQLAllFather My guess is that spark is mistaking it for a directory. If all your files have the same columns, maybe just try reading the parent directory?
But I agree with the reply above: just writing a script to rename the files seems to be the most straightforward solution.
This must be specific to spark. Python's stdlib csv module doesn't care what the filename is, or even if it has one.
Perhaps there's an additional param to that spark function that lets you override detection-by-file-extension? Haven't used spark myself.
Otherwise, a shell one-liner will do it:
for file in ./somedir/* ; do mv "$file" "$file.tsv" ; done
@SQLAllFather does it work if you use pathlib to explicitly make it a path first i.e.
from pathlib import Path
df = spark.read.csv(Path("customerdata"), sep=r'\t')
Thank you!
While experimenting with this approach I discovered that the root error was on my end, and that I'd entered an invalid path - I was trying to work with a file that didn't exist. I guess this will teach me to sleep on errors before reaching out for help.
@SQLAllFather oh yes that would also do it :)
Glad to be of service!