DreamObjects data (maybe) corrupted, support can't tell me more

dreamobjects

#1

TL;DR: DreamObjects migration was failing; I was told some files had been corrupted upon upload; it’s been 3 weeks with no resolution nor details.

Hi there.

I think I’m generally a patient person, but I am fed up with the (lack of) response of Dreamhost’s Support.

Due to the change in the data centers I was told to migrate my DreamObjects data. Why this was not automatic I cannot guess. But ok, I started migrating my data, I had four users each with one bucket. How hard could it be?

Of these four migrations, 2 got stuck.

I opened a support ticket on the 5th of September. I heard nothing and wrote back every few days asking for updates. Finally on the 18th, they identified three files as the source of the trouble:

Our Cloud Engineers believe that those files are corrupted, most likely
from when they were first uploaded.

And they assured me:

You are safe to start using the new DreamObjects endpoint
(objects-us-east-1.dream.io) in your applications, in spite of that
issue.

I wasn’t so confident about that. I explained that I needed to know more info here. What does “corrupted” mean, and what exactly went wrong with the process? Do I need to submit a bug report to the uploading application? And how can a corrupted file be sitting on their system undetected for months? Why didn’t it get flagged upon upload?

In short, what are we doing to prevent this from happening again?

I sent them my checksums for those files (yes they’re big files, multi-part uploads) and this was their reply:

Hmm. Curious. Those md5sums don’t match the md5sums currently in the
US-West 1 cluster version of the bucket, but they do match what we get
when we download from the US-West 1 cluster to a Linux server. Since
these objects are greater than 5 GB, a multi-part upload would have been
required, so maybe the md5sums aren’t going to match and we can just
force the migration anyways, but we’re still working with our
DreamObjects engineers to verify that fact.

I also verified that I could download these files (from the old cluster) and they were not corrupted. But still no answer to my questions. I repeated them. (Along with a few other questions, e.g., “What do you mean by md5sums in the case of multi-part uploads?”)

Yesterday (almost 3 weeks since the start of this), I got this reply [edited for brevity]:

Unfortunately at this moment I don’t have details on why our “Cloud
Engineers believe that those files are corrupted”, …
I have to check with the support representative who
worked with the engineers on this particular case.

I’ve forwarded this newest support ticket / thread to him now, but as I
understand it he’s out of the office for a few days, so it will take a
little bit to get a response.

At this point I’m out of patience and don’t know what to do to get this escalated. And really, the support person is out of the office? How am I supposed to react to that? (“Oh, ok, I’ll wait. Hope he’s having a great time.”) How many other tickets are on hold?

Is Dreamhost serious about data storage or is it just a side project? Ok, I admit I’m a small user but what if my business were depending on this data every single day? How do they expect to keep customers when they can’t quickly fix a problem—or even give me enough info to feel confident to continue using their storage?

Exactly what is the point of using a cloud storage provider if they can’t give me immediate support? Why not just buy a NAS?

Thanks for listening!


#2

Hi there!

We apologize for any frustration! If you can supply us with a support ticket# or domain name on the account we can look into the status and get it escalated over to a manager,

Thanks!
Matt C


#3

There are now several ticket numbers:

8343861
8346583
8348612

[It appears that new ticket numbers are created every time I reply to an email. I would suggest that’s not the best system.]

Thanks.


#4

Thank you for those ticket #'s I was able to get them over to a support manager to look into and they were also sent over to our Cloud team to investigate further, they will update you by email accordingly.

Thanks!
Matt C


#5

Got a reply from support (not a manager nor an engineer).

Our DreamObjects engineers have finally been able to finish their
investigation into why those three objects were not migrated. But first a
brief discussion of terms. DreamObjects is built on Ceph which utilizes a
RADOS Gateway as the in/out access point for all data. Due to a various
assortment of RADOS Gateway bugs, the initial upload of those three
objects did not actually complete successfully, but the aforementioned
bugs combined in a unique way that caused Ceph to incorrectly catalog the
objects in the bucket index. Unfortunately due to those bugs, we are not
able to recover those objects.

Needless to say, this does not inspire confidence in the system. :wink:


#7

i have got the similar failing migration issue with one of my website while migrating, it says files are corrupted and i could not find a way to make it happen and migrate it properly?

Source: www.thetechnologypost.com


#8

Do you know exactly which files are corrupted? And if so, are they particularly large (e.g., > 5Gb)?

Unfortunately, the problem hasn’t gone away with the new cluster. I re-uploaded my failed files to the new cluster and asked the support team to verify them. Their reply:

I’m getting inconsistent results when verifying those new uploads.

Again, I wish the support team wouldn’t speak to me in vague terms. What on earth does “inconsistent results” mean? I very much hope their servers are not giving inconsistent results.


#9

Update, this just in from Dreamhost Support:

Unfortunately, at this exact moment I’m not sure how we can
definitively verify that those uploaded files are not corrupt. I mean, it
should be noted that that message from last month was from a time when it
was not clearly understood by myself that multipart objects literally
could not have a matching md5sum due to the way in which the objects were
being assembled/uploaded. But that still doesn’t answer your question.

And with my limited knowledge only being slightly increased to understand
the whole md5 multipart ETag situation, the best I can do for you right
now is try a Google search for “verify s3 upload” and forward your
question to our DreamObjects engineers for review. Both of which I’ve
done now (although the Google search is preliminarily only showing me
articles and forum discussions about calculating the md5sum before
uploading and attaching it to the upload as metadata).

Once again, I’m saddened to see that this kind of a reply is acceptable customer support (and after nearly seven weeks of this issue being opened. The customer service rep is admitting that it’s over his head (good for him) but why must I still have him as the gatekeeper of all this info? Any suggestions as to how I can get this escalated?


#10

In the last email update, they did ask for some additional info that might be able to help them verifying the data. You can view the support history here https://panel.dreamhost.com/index.cgi?tree=support.msg& and also reply back to the last email they sent or create a new ticket via that link as well.

Thanks!
Matt C


#11

Yes, HD_Matt_C, they did indeed ask for additional info. Here’s the relevant part:

Note: Oh, I did just think of a third option for verifying the data, but
it requires more information from you about what these large files are
and what sort of application we should be using to look at them. Would
you be willing / able to provide us with that information?

I don’t mean to sound sour, but once again this doesn’t inspire confidence. To be blunt, this suggestion is absurd.

Why on earth would a cloud storage provider need to know “more information” about a file in order to guarantee data integrity? They didn’t need any information at all to determine that the objects were “corrupt” (their term) when this problem first started. Suppose I told them: “This file is an encrypted database for a proprietary program which optimizes 3d models for efficient delivery over cel networks.” How is that going to help?

And this kind of analysis can’t scale. Now that I’ve had 3 bad objects, how exactly do we verify all the others? Am I to provide them with a spreadsheet telling “what each one is and what application they can use to look at them”?

[OMG, I’m actually replying to this suggestion as if it were rational.]

No, I will not go on a wild-goose chase to verify if my data—which I pay them to store—is corrupt. That is their job. If I wanted to debug this myself, I’d buy myself a NAS and run minio on it and accept all the headaches associated with running one’s own S3 service.

I simply cannot believe that Dreamhost is serious about being an S3 provider if they cannot guarantee—or even easily verify—the integrity of the data they are storing.

And, once again, my support experience on this issue has been crazy slow and taken far too much of my time.

I may sound angry, but I’m actually sad. I’ve a Dreamhost customer since 2010. Back then it was so easy to get an engineer on the line…Sad to see what was such a good company go so far downhill.


#12

Ok, this is probably my final update on this topic. Just in case anybody is still reading. :wink:

I wrote back to Dreamhost Support, told them I wouldn’t spend any more time on the problem, re-iterating that I pay them for this service. They replied that they had done all they could, which still makes no sense to me.

Further, it became clear that during their attempts to diagnose this problem, they had downloaded my files from their servers. That’s fine. The odd part is that I was charged for these downloads! Fortunately I caught it and they said they’d refund me that charge.

So, let’s summarize shall we?:

  1. Dreamhost migration failed. I filed a ticket, but got no reply.
  2. After 2 weeks and repeated emails to them, they found that my files were corrupted.
  3. A week later, they were able to tell me details about that corruption.
  4. I re-uploaded the data, asked DS to verify that the problem was fixed for this new data.
  5. They performed “tests” which were unrelated to the data corruption and which my account was billed for.
  6. Eventually, they admitted that they had no way to verify that the data corruption problem was in fact fixed.

Total elapsed time, 2 months.


#13

Thank you, barkofdelight. Your tale confirms my belief that support has a few clunkers and it seems a number of support messages just get lost. I’ve had my own hiccups with the migration to the east. I look at the usage tab, I guess it shows usage on east, since it shows all zeros for Jun-Jul and 1 TB for Aug-Sept and on up to now. I’ll need a few more weeks of access to the files on west to be sure everything is correct on east. I was surprised that the ACL was changed in the move. Except for not getting any help from customer support, I’d say the move was only as painful as I expected. Of course it was a total waste of my time, with no improvement in speed or cost, and reliability is still to be proven.
-mort


#14

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.