Another Synology Drive data loss bug
Notwithstanding their recent boneheaded announcement (reported on by Ars Technica) about restricting which drives can be used in their NASes, #Synology gets most things right, but every once in a while their apps just… lose data, and it’s not clear that they care.
I’ve written before about a Synology Drive Client bug on Linux they’ve known about for years and haven’t bothered to fix. And then there’s the time one of their NASes had a gradually manifesting hardware bug that they could have notified customers about and proactively done a recall, but instead they just let customer NASes fail at which point they were forced to shell out money for a new one.
Today I’m hear to tell you about another data-loss bug in Synology Drive, and the workaround I’ve been forced to implement to avoid having it bite me (again).
Simply put, sometimes Synology Drive Client stops pulling files down from the server. When this happens it claims that everything is fine and it’s synchronizing successfully and it will happily upload to the server any files you modify locally, but any files modified on other computers and synchronized by them to the server don’t get pulled down to the computer that is in this broken state.
Let me say this again: it claims everything is working properly but it isn’t. That’s generally considered Really Bad.
You can get the client to start synchronizing again by restarting the client, but (a) it’s not clear to me that files which weren’t synchronized in the interim get synchronized when you restart, and (b) there are various data-loss and data-conflict scenarios which occur when you modify files on multiple computers when one or more of them aren’t synchronizing properly.
I don’t know the root cause of this, so I don’t know of any way to prevent the problem from happening. Therefore, instead I am now running a script every minute on all of my computers that sends and receives “pings” to/from the other computers in the group via temporary directories and files created within my Synology Drive directory. The script emails me when it doesn’t receive a “response” to a ping it sent to one of the other computers in the group. This means I’ll get some spurious emails when one of my computers is asleep or not on the network, but these are a small price to pay compared to the price of losing data because Synology Drive is failing again.
I haven’t reported this issue to Synology Drive because it’s intermittent and I have no idea how to reproduce it so I’m certain they’ll blow me off.
Here’s the script, for those of you who are curious.
#!/bin/bashset -eshopt -s nullglobPINGDIR=~jik/CloudStation/tmp/syno-pingsME=$(hostname --short)DEBUG=falseINTERVAL=60while [ -n "$1" ]; do case "$1" in -d|--debug) DEBUG=true; shift ;; -i|--interval) shift; INTERVAL="$1"; shift ;; -*) echo "Unrecognized option: $1" 1>&2; exit 1 ;; *) break ;; esacdone if [ -z "$1" ]; then echo "No remote host(s) specified" 1>&2 exit 1fidebug() { if ! $DEBUG; then return fi echo "$@"}file_age() { local path="$1"; shift now=$(date +%s) if then=$(stat -c %Y "$path" 2>/dev/null); then echo $((now-then)) else echo missing fi}wait_for() { local delay="$1"; shift local path="$1"; shift age=$(file_age "$path") if [ $age = missing ]; then echo missing elif ((age < delay)); then echo waiting else echo finished fi} settling() { local path="$1"; shift case $(wait_for $((INTERVAL/2)) "$path") in missing) echo missing ;; waiting) echo yes ;; finished) echo no ;; esac}late() { local path="$1"; shift case $(wait_for $((INTERVAL*2)) "$path") in missing) echo missing ;; waiting) echo no ;; finished) echo yes ;; esac }dohost() { local them="$1"; shift debug Working on pings from $ME to $them # Note if we were previously broken. set -- $PINGDIR/ping.$ME-$them.*/broken if [ -n "$1" ]; then was_broken=true else was_broken=false fi debug was_broken=$was_broken # Clear any pings that have been answered for ping in $PINGDIR/ping.$ME-$them.*/ack; do dir=$(dirname $ping) if [ $(settling $dir) = yes ]; then debug Ignoring recently acknowledged ping $dir continue fi debug Clearing acknowledged ping $dir rm -rf $dir done # Check for old pings that have not been answered yet. is_broken=false for ping in $PINGDIR/ping.$ME-$them.*/syn; do dir=$(dirname $ping) if [ -f $dir/broken ]; then debug $dir remains broken continue fi if [ $(late $dir) = no ]; then debug Ignoring recently generated ping $dir continue fi is_broken=true echo $(date) > $dir/broken debug $dir is newly broken done if $was_broken && ! $is_broken; then echo Pings from $ME to $them have recovered elif ! $was_broken && $is_broken; then echo Pings from $ME to $them are failing, one of us is not syncing 1>&2 fi # Create a new ping. newpingdir=$PINGDIR/ping.$ME-$them.$(date +%s) mkdir $newpingdir echo $(date) > $newpingdir/syn debug Created $newpingdir/syn}# Respond to pings sent to me.for ping in $PINGDIR/ping.*-$ME.*/syn; do dir=$(dirname $ping) if [ -f $dir/ack ]; then debug Ignoring already acknowledged ping $dir continue fi result=$(settling $dir) if [ $result = missing ]; then # Other end deleted it debug Ignoring $dir after it disappeared continue elif [ $result = yes ]; then debug Ignoring recently received ping $dir continue fi debug Responding to $dir echo $(date) > $dir/ackdone for them; do case "$them" in *\ *) echo no spaces allowed in host names 1>&2; exit 1 ;; esac dohost $themdone