From some of the comments on my previous post, I think I haven’t done a very good job explaining what happened and the nature of how this has affected FeedLounge. I’ll try to do this here, in the form of an O’Grady-style Q & A.
Are you telling me your server literally burned up?
Yes, the fan that was supposed to keep the CPU cool stopped working. The CPU overheated and burned a little CPU-sized crater in the motherboard.
So FeedLounge was running on this server?
No, neither FeedLounge or the feedlounge.com web site were running on the server that burned up.
Umm, then why can’t I get to the ‘Lounge or the feedlounge.com web site?
The dead server was the primary DNS server for feedlounge.com – it was the magic beans that told the rest of the internet where to go when you typed ‘feedlounge.com’ into your browser.
Wow, this box was responsible for making feedlounge.com accessible to the rest of the internet and you only had one of these?
Actually, we had three of them. We had the box that burned, and two other separate boxes hosted by Austin Web Development.
I don’t get it. If there was more than one box, then why didn’t the back up boxes kick in?
About a week ago, Austin Web Development re-configured their DNS servers. They gave us warning beforehand and we checked through all the sites on the now-crispy server to make sure we weren’t relying on the Austin DNS for any of them.
Why didn’t you also check this for feedlounge.com?
Unfortunately, the answer is really simple – we forgot. We moved the feedlounge.com web site to a dedicated server in a data center in New Jersey last summer around the same time we move the ‘Lounge onto our big servers in our rack space in San Francisco.
Since feedlounge.com was no longer on boxes at the Austin data center, we didn’t think to check the DNS records for feedlounge.com.
Do you now feel that was monumentally stupid?
Um, yeah. And then some.
Are you saying that if this had happened a week ago, it wouldn’t have caused any trouble?
Most likely, yes. We’d have replaced the fried box just like we’ve done, but the backup DNS servers would have shouldered the load while we did so.
Sounds like you guys should have paid more attention to this.
Agreed. Lesson learned – the hard way.
So why is it taking so long for FeedLounge to become available again?
DNS is a bit complicated. Scott has written up a great post on this on the FeedLounge blog. The root of the problem is that a DNS change can take 24-48 hours to propagate throughout the internet.
That’s ridiculous, can’t you speed that up?
Unfortunately, no. I really wish we could.
Why can’t you put up a message saying that the server is down – at least an explanation for people?
That’s the real problem. The DNS information is what allows your ‘feedlounge.com’ request to go to the proper server. If we could get you routed to a server to see the message, we could just serve up the ‘Lounge and the feedlounge.com web site to you. The problem is getting you to the right place, not that FeedLounge is down.
How can you say that FeedLounge isn’t down? If I can’t get to it, it is down!
I can certainly understand how you could feel that way, but there is a little bit of a difference. The way the DNS…
Stop blaming this on DNS! I don’t care about the technical details – I care that I can’t get to FeedLounge.
Ok, let’s try an analogy.
Let’s say FeedLounge is a car (maybe a BMW). You drive to a nice restaurant and give the car and the keys to the valet – not thinking much of it. After dinner when you go get your keys, the valet tells you that there was an accident and unfortunately your key was broken.
“No problem” you think, “I gave my buddy my spare set of keys when I bought the car – I can just get the spare set.” Unfortunately, when you call him you find that your buddy left last week for vacation on a remote island – he’s not going to be able to help you.
Now which is more accurate in this situation?
- My car is broken.
- I can’t get into my car.
This is where we are with FeedLounge. The service isn’t down or broken, the problem is getting to it.
So are you saying this isn’t your fault?
Of course not – it’s definitely our fault. We weren’t diligent enough with our backup DNS servers and got burned1 because of it.
It’s the same way you’d be responsible for not having the forethought to make sure your spare set of keys was available “just in case” when you valet’ed your car.
So if the service isn’t down, there has to be some way for me to get to it.
If you’re having trouble accessing the server, you can add a DNS server listing that has the proper information for FeedLounge to your computer’s list of DNS servers. That DNS server is “65.90.218.228”, however we’ve gotten lots of reports from folks who are already back in the ‘Lounge already so hopefully it won’t be long now for anyone still affected.
I think it’s totally unreasonable to ask me to do this.
I wish there was another way, but this will give you access to FeedLounge now. The alternative is to wait for the DNS changes to reach your DNS servers. This is how DNS works.
So what do you do now?
Besides offering an apology to our users, there isn’t much we can do. We have to wait for the changes we’ve already made to take effect.
Can you at least promise this won’t happen again?
Unfortunately, no. In fact (as I’ve said before), I’m quite sure it will happen again at some point. What we can promise is that we’ll continue to take responsability and to be open and honest about what is going on – even when things go wrong and being honest makes us look stupid.
As hard as we try to make sure they don’t, web services go down or are unreachable at times. This time was our fault for not having working backup DNS entries. While I think it’s highly unlikely this particular problem will surface again, the last time folks had trouble accessing FeedLounge was due to a DDOS attack on Live Journal that affected our entire data center; unfortunately that is hardly something we can promise won’t happen again.
That is totally unacceptable for a service I’m paying for!
I’m sorry you feel that way, and it’s not an uncommon response whenever there is a problem with FeedLounge or any other paid service.
We work really hard to keep the service up and running all the time, and I feel our track record in our first 4 months of service is excellent. However, the fact of the matter is that this is a service funded by Alex and Scott, and run by Alex and Scott. There is no Yahoo! or Google or VC with deep pockets to pay for mirrored data centers, or 24 hour IT staff managing the servers.
This is how bootstrapping works and we’ve tried hard to be transparent and upfront about this since the beginning. I hope that it doesn’t come as a surprise to anyone.
I’m not satisfied or happy about this.
Trust me, we’re not either.
Ok, now what?
It was a bad day for FeedLounge. With apologies to our users, we fix the problem and move forward. As we do, we’ll continue to work hard to make FeedLounge as reliable as possible and continue building the features our users are asking for.
- Pun intended. [back]
This post is part of the project: FeedLounge. View the project timeline for more context on this post.
This is the post I was hoping to read from you. Thanks for writing it. 🙂
well done, glad DNS is back up!
Alex, I think some of the resentment stems from the fact that in all the communications from you and Scott talking about this problem, there really wasn’t ever a straight up apology. It had the tone of “it was the server’s fault – we didn’t do it!!!”
That may seem over-the-top to you, but it’s important for users that the admins take responsibility for the issue and demonstrate that they understand this is about customer service, not obscure technical issues. I know you work really hard to make FL a great service, but sometimes I think you might believe we’re more interested in the nuts and bolts of how things work than some of us actually are. What matters is, does it work?
In the future, the best thing to do in addition to explaining the problem (which you did patiently and more than adequately) is to explain that you understand you’re still responsible for providing the service, that you failed because of A, B, and C, and that you’re sorry, and here’s how we’re going to prevent it from happening again. The “saying sorry” part seems trivial because you KNOW what’s happening and feel bad about it, probably, but it is a big part of making us customers comfortable with these two dudes out in cyberspace that we send money to. 🙂
Thanks guys.
Just a little more info on the previous post: I took 5 minutes to throw it on the new server right after the old server died – as we were franticallly tring to get the new server up. When I wrote it, we didn’t yet know that FeedLounge was going to be affected at all – hence the hasty UPDATE to the post after the fact.
When I added the UPDATE, I had already posted an explanation and mea culpa on the FeedLounge blog, problem was people couldn’t get to it and thought that the little PS I added to that post was it.
Couple that with confusion about how DNS works and the most innocent things can get misunderstood – I didn’t fully realize the communication problem until late last night.
If people dont want to wait for DNS to propagate they should be able to get to it by typing the ip address of the feedlounge server directly (not sure what it is since DNS is down 😉 into the browser.
There really isnt any reason you need to mess around with the whole DNS server thing unless you are doing name based virtual hosts or something else like that.
Robert: except that doesn’t help you with your authentication cookies, now does it?
Seriously, you’re splitting hairs with the service being available thing. If I can’t get to a service, it’s down. Simple. If I can’t get into my car, then I can’t drive it. Simple. Either way feedlounge was unavailable for a long period of time.
Other than that I’ll just second Jeremy’s comments.
Wow, I think that’s one of the most ridiculous things I’ve ever read.
Oh look – I unplugged my network cable, and every web service and web site in the world was DOWN!
And I also think you guys are being absurd asking for an apology. They were clearly trying to get things working again, isn’t that the important thing? Anyone can give an apology. I’d rather have action over words any day.
Ok Bill, I’ll add something to the end of that statement “If I can’t get to something because of something under the control of those running the service, it’s down”. Does that seem less “rediculous”? Because it was kind of implied.
—
I can certainly understand how you could feel that way, but there is a little bit of a difference. The way the DNS…
—
Uh, no, the way DNS works isn’t relevant. If the user can’t get to it, it’s down. The only difference to an end user is symantics.
I went to your web site PatrickQG, and saw that you were 21 years old. Ahh to be young, idealistic and naive again. The world must be very simple for you right now – enjoy it.
As a thought, you could always have registered a dyndns (or similar) entry, and tell people to use it. Since they run a real short TTL, it would at least be something quick to give people that went to the right spot. Don’t know the dyndns policy on commercial use/bandwidth/query limits, but worth a shot.
You could also outsource your DNS to someone like dnsmadeeasy…
Bill: How kind of you. I don’t think there’s anything naive about being strong in my belief that there’s no point (when you’re running a service) mincing words about when it’s alive or not. Maybe the services I’ve run have had different clients – they certainly wouldn’t accept “well, it was just the dns, the application was still running… even though you couldn’t reach it”.
Let me assure you their is nothing “simple” about my view of the world.
Kevan: the only real problem with a dyndns entry is how to let your user know about it. And of course going to a third party dns provider would have still had the same delays.
Now who’s being ridiculous Bill? 😛
I do agree, saying the service was up is … semantic in my opinion. But I do think Alex has done a good job of coming around and saying “Look, we learned a crappy lesson from this and we’ll do our best to keep it from happening again”.
Expecting an apology for this isn’t absurd or out of the question.
Fair enough, Alex. Not trying to browbeat you, but as FL grows, you’re going to get less of the “this is a cool web app and I know that because I’m a developer myself” crowd and more of the “just because I haven’t pushed the power button on my dell doesn’t mean your application isn’t broken” crowd. So take it all in stride! 🙂
My 2¢: these things happen. I think you guys are doing great. Keep it up, learn from mistakes, move on, forward, upward, growing, etc.
You’ll still get my $5 a month!
[…] Q & A on the FeedLounge DNS Outage […]
[…] upgrading my laptop and do so. A burned up CPU forces an emergency web server move. FeedLounge experiences another outage of sorts and I learn about DNS. My dad and I install a new garage door […]