Error 512: Human in the loop detected!

Do not underestimate the power of the keyboard to bring down the internet:

The real Uncle Xi behind the Chinese stock market crash!

If you ever wondered who is running/ruining the Chinese economy, take a look the Chinese market stories from two different magazines from different times:

This is a story in the Economist from July 11, 2015:

Now look at the story from the NY Times about the subsequent crash on Aug 24, 2015:

In case you do not see the photos or they disappear behind a pay wall, here are the saved images from the stories.

The economist photo with real Uncle Xi:


Here is the real Uncle Xi peeking over the newer crash:


Robert Tarjan : The art of the algorithm

I keep contesting that O(n) planarity testing algorithm from Hopcroft and Tarjan was the last original and elegant algorithm. When you boil the complexity out of any problem, it always leaves you with an elegant algorithm that can be implemented. Even the original planarity paper also reported run times though run times were a tad slow then: “An ALGOL implementation of the algorithm successfully tested graphs with as many as 900 vertices in less than 12 seconds.”

HP Labs – Inventor interview – Robert Tarjan : The art of the algorithm.

“Elegant algorithms are easy to program correctly, as well as being efficient. A clever algorithm that is clean and elegant is much more likely to be used than a messy one. When people understand how an algorithm works, which is much more likely with an elegant algorithm, they are more likely to have confidence in the results it produces.

Also, elegant solutions are much easier to generalize, to extend to other problems. My goal is to find general approaches and solutions, not ad hoc tricks.”

Derailed: A rant against fad frameworks

The first time I tried Ruby on Rails to develop something non-trivial and meaningful, I was very happy with being very productive in a very short time and being able to use the framework to code less and do more. Once I completed a few side projects in RoR, I did not really pay attention to them except for occasional maintenance. Occasionally, there was a small bug fix or a feature addition when one of those side projects was actually in production use at work.

The thing about frameworks is that you have a steeper learning curve than learning the basic language itself (learning Ruby on Rails vs. Ruby or learning Java vs. Java Class Libraries). However, the idea (hope) is that you are more productive once you learn the framework so you code less and do more. You also understand the framework well enough that you can debug cryptic compiler errors or esoteric runtime errors when the actual description makes no sense at all (all frameworks including C++ template libraries have a way of generating Confucian error messages).

I have recently started doing lot of heavy lifting again in C++. Writing in C and C++ is like riding a bicycle, once you figure out how to do it and how to use OOP, Templates, STL, and BOOST, you remember it for the rest of your life. Of course, you might have to learn a few things every 5 years or so when the standards committee gets around to approving a few new features and compilers implement them.

Last week I tried to fix an error in a RoR app because the hosting company updated their RoR framework. I suddenly realized that my love for RoR has turned into a dislike. Not only that, I am even wary of other similar frameworks now. The actual fix was one line but nothing in the stack trace would tell me that and I was out of touch with the Rapidly Accelerating Degeneration of the framework.

I have spent a lot of my life doing C++, x86 (chips and software), Windows, and Unix. The best thing about all of them is that I can get out my x86 assembly code from my under graduate system programming class (if I can find a floppy reader) or the C++ code from my first job last decade and expect to compile it with whatever assembler or compiler I can find now. Not only that, I can run the unix and windows programs I wrote over the last couple of decades and expect them to still run on the respective platforms. That even includes running a BSD sockets and console based multi-chat coded on a VAX (Yes, we had a VAX in IIT!) on Linux now.

What is the problem with frameworks like RoR? They blatantly disregard backward compatibility. When you finish development in one version, the next minor version you start seeing deprecated style and method warnings and the version after that your app stops working. And this happens on a scale of months rather than years. I am not sure if any of the other fad frameworks have the same problem. Of course, the standard answer from the perch is to rake rails:freeze. Yeah, Right! That is only reasonable for a project that is done and shelved and never touched again! The reason you are using a fad framework is because you are really working on a live evolving application. The moment you freeze your RoR and go back to add some feature to it a little later you suddenly realize you are stuck between a rock and a hard place. Do you unfreeze and fix our app for new framework version before you add the new feature? Do you stick to the version you have but are deprived of being able to use that gem which exactly solves your problem but is only being developed for the new version of the framework? Also, most answers on stackoverflow are suddenly useless to you because they all refer to the current version of the framework.

Of course, I would not use C++ to develop a web based application. I would still have to look into Scala/Lift, Dart, Node.js, insert your favorite here. However, I would be a lot more careful about looking at the backward compatibility history/promise (if any) of the languages and frameworks de jour.

Google IRC: Group Chat Rooms on Google

We have been recently struggling with trying to find a good group discussion forum for real-time collaboration. With members of new team spread over multiple time zones and network availability, it is good to have real-time interactivity without going through email or mailing lists. IRC with some logging and searchability would be nice as long as it is reasonably secure. We are still looking into it, so let me know if you have any good answers.

While trying to figure out, I happened onto the cool Google Chat Room hack. Google group talk by default only lets you create invitations to group chat and add more people to it. The is no subscribe mechanism to join an existing group chat unless someone in the chat invites you in. if you drop off the connection, you have to be re-invited again. What we want is a persistent chat room in Google Chat.

If you happen to be using external chat software (ala Pidgin, Adium, etc.), then there is a hack to create a pseudo room.

It turns out that there is a way to keep a persistent channel open using group chat. All google chats are named as private-chat-UUID. From pidgin etc. You can join a chat with whatever UUID you pick and it creates the room if it does not exist. Once the room is created and you invite people, all the people can log off and on but can still
join the same room without needing to be invited again. Voila! Channel. The nice thing is that all these chats show up on the gmail folder and are searchable. So you just join a chat with a randomly generated UUID (using uuidgen or other web tools for random UUID generation). Note that just knowing the random UUID is not enough for outsiders to join the room. All participants have to be invited at least once by the initiator. So it does remain secure as a normal group chat.

This blog post has more details.

Now that there is an IRC/ChatRoom, what about people who are interested in seeing what the discussions were when they were offline and search for any technical information in the future? It is relatively easy to use a script running on an always on machine to capture all the google chat and email it as an archive to all users. Here is the shell of such a script. Making it send emails or create other archives is left as an exercise to the reader…

# Basic script adapted from Net:XMPP to join a chatroom and print out the log...

use Net::XMPP;
use strict;

if ($#ARGV < 6) { print "\nUsage: perl \n\n";

my $server = $ARGV[0];
my $port = $ARGV[1];
my $username = $ARGV[2];
my $domain = $ARGV[3];
my $password = $ARGV[4];
my $chatroom = $ARGV[5];
my $resource = $ARGV[6];
my $connectiontype = 'tcpip';
my $tls = 1;

$SIG{HUP} = \&Stop;
$SIG{KILL} = \&Stop;
$SIG{TERM} = \&Stop;
$SIG{INT} = \&Stop;

my $Connection = new Net::XMPP::Client();


my $status = $Connection->Connect(
hostname => $server, port => $port,
componentname => $domain,
connectiontype => $connectiontype, tls => $tls);

if (!(defined($status)))
print "ERROR: Server is down or connection was not allowed.\n";
print " ($!)\n";

# Change hostname
my $sid = $Connection->{SESSION}->{id};
$Connection->{STREAM}->{SIDS}->{$sid}->{hostname} = $domain;

# Authenticate
my @result = $Connection->AuthSend(
username => $username, password => $password,
resource => $resource);

if ($result[0] ne "ok")
print "ERROR: Authorization failed: $result[0] - $result[1]\n";

print "*** Logged in to $server:$port...\n";


print "*** Getting Roster to tell server to send presence info...\n";


print "*** Sending presence to tell world that we are logged in...\n";


while(defined($Connection->Process())) { }

print "ERROR: The connection was killed...\n";


sub Stop
print "Exiting...\n";

sub InMessage
my $sid = shift;
my $message = shift;

my $type = $message->GetType();
my $fromJID = $message->GetFrom("jid");

my $from = $fromJID->GetUserID();
my $resource = $fromJID->GetResource();
my $subject = $message->GetSubject();
my $body = $message->GetBody();
print "$resource: $body\n";


sub InIQ
my $sid = shift;
my $iq = shift;

my $from = $iq->GetFrom();
my $type = $iq->GetType();
my $query = $iq->GetQuery();
my $xmlns = $query->GetXMLNS();
print "===\n";
print "IQ\n";
print " From $from\n";
print " Type: $type\n";
print " XMLNS: $xmlns";
print "===\n";
#print $iq->GetXML(),"\n";
#print "===\n";

sub InPresence
my $sid = shift;
my $presence = shift;

my $from = $presence->GetFrom();
my $type = $presence->GetType();
my $status = $presence->GetStatus();

print "$from: $status\n";


Consumption versus Creation: Why I like the Padfone Tablet/Phone

I have been evaluating the numerous tablets (including the iPad obviously) and trying to figure out if I really need one. Since I already spend 8+ hours a day in front of a laptop/desktop, it seemed a little redundant to add another device into the family. The few times I have had to lookup something while around the house but not going to the computer, I just use the iPhone.

However, recently my wife noticed that I have been using the phone a little too much while not actively using my laptop. She wondered aloud if she should get me an iPad. However, when a book she wanted to read urgently did not show up from Amazon, I convinced her to get a nook color and download the book instead.

Of course, the moment she was done reading the book, the nook color was hacked and overclocked. Now I have a perfectly functioning 7 inch Android tablet with a gorgeous screen (1024×600 IPS). There are other tablets but not a $225 device with such a good screen. Android just runs off the SDcard too so I can just reboot to the stock nook experience without any hoops. I have tried booting into a Honeycomb version too but decided that it was too unstable for the kids though I loved it. (Here is the xda-developers thread for the adventurous).

While customizing the Android on nook color I realized how much I missed my Windows Mobile 6 phone where you could just change the xml to customize the home screen. On the iPhone you have to go through too many button presses to glance at the calendar or latest email. In WM6 (and in Android and other modern smartphones), you only need to look at the widget on the home screen to glance at latest email headers and calendar appointments. If the stock home screen doesn’t have it, no problem, there is a home screen app that will do it on Android. No such luck on the iPhone.

Once I went back and analyzed the patterns over a few days, I figured out my personal use-case for phones, tablets, and laptops. It is all about Content. In addition to using the phone for phone calls, the reason I use a smartphone is really for browsing to a few sites, checking up on news or letting the kids play a few games. Most of these happen at home or work (near a wifi zone) and almost always involve Consumption of content. This included kids playing videos, music or games which I see happening a lot in households with tablets. A larger screen would definitely make a difference for the couch browsing that most Apple ads demonstrate. However, multiple friends have confided that they have rarely done much more than type a few emails on the their tablets in terms of actually authoring anything.

When Creating content (long emails, documents, or code), I find it much more convenient to use the laptop with comfortable physical keyboard and familiar software. Of course, on a soft keyboard you can peck out a few emails with thumbs but never complicated explanations related to actual work. I was intrigued by the Atrix with it’s dock and wondered if it would make me rethink my usage of the tablet. Of course, there will always be people whose only content generation is emails and some powerpoint for which a phone with a keyboard or the laptop dock makes sense. But, for too many people Atrix+dock is a more expensive combination that is trying to solve the wrong problem. It is trying to replace a netbook or the laptop (Windows or Mac), which is a much bigger hop.

I told friends a couple of weeks ago that the real killer combination for me would be a dumb large screen with a battery that I could slot my phone into and instantly have a Content Consumption screen that is larger but adds no new system to manage apps+data or pay for access costs. I couldn’t understand why no one else could see the obvious need 🙂 Maybe my use case was oddball?

Lo and behold, ASUS announces the Padfone. Hate the name (and the chintzy intro ad), love the concept! Now, if I could only find a way to wean myself off of my grandfathered unlimited iPhone data plan…

P.S. Why is it so difficult for the iPhone keyboard to show lowercase versus uppercase letters when capslock is on instead of ONLY highlighting the capslock softkey? I realized how much more annoying that iPhone doesn’t do it after I started using the android keyboard.

Fault Tolerance: When it segfaults…

Note the following excerpts from the 2011 Amazon Web Services (AWS) outage explanation from Amazon:

When a node loses connectivity to a node to which it is replicating data to, it assumes the other node failed. To preserve durability, it must find a new node to which it can replicate its data (this is called re-mirroring). As part of the re-mirroring process, the EBS node searches its EBS cluster for another node with enough available server space, establishes connectivity with the server, and propagates the volume data. In a normally functioning cluster, finding a location for the new replica occurs in milliseconds.

Two factors caused the situation in this EBS cluster to degrade further during the early part of the event. First, the nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly. There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication. In a normally operating EBS cluster, this issue would result in very few, if any, node crashes; however, during this re-mirroring storm, the volume of connection attempts was extremely high, so it began triggering this issue more frequently. Nodes began to fail as a result of the bug, resulting in more volumes left needing to re-mirror. This created more “stuck” volumes and added more requests to the re-mirroring storm.

Here are excerpts from the AWS outage of 2008:

As background information, Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer’s request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn’t able to successfully process many customer requests.

We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers’ objects. However, we didn’t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

Now guess which service outage (and when) the following excerpts refer to:

The 4ESS switch used its new software to monitor its fellow switches as they recovered from faults. As other switches came back on line after recovery, they would send their “OK” signals to the switch. The switch would make a little note to that effect in its “status map,” recognizing that the fellow switch was back and ready to go, and should be sent some calls and put back to regular work.

Unfortunately, while it was busy bookkeeping with the status map, the tiny flaw in the brand-new software came into play.

But the switch had been programmed to monitor itself constantly for any possible damage to its data. When the switch perceived that its data had been somehow garbled, then it too would go down, for swift repairs to its software. It would signal its fellow switches not to send any more work. It would go into the fault recovery mode for four to six seconds. And then the switch would be fine again, and would send out its “OK, ready for work” signal.

However, the “OK, ready for work” signal was the very thing that had caused the switch to go down in the first place. And all the System 7 switches had the same flaw in their status-map software. As soon as they stopped to make the bookkeeping note that their fellow switch was “OK,” then they too would become vulnerable to the slight chance that two phone-calls would hit them within a hundredth of a second.

It only took four seconds for a switch to get well. There was no physical damage of any kind to the switches, after all. Physically, they were working perfectly. This situation was “only” a software problem. But the 4ESS switches were leaping up and down every four to six seconds, in a virulent spreading wave all over America, in utter, manic, mechanical stupidity. They kept knocking one another down with their contagious “OK” messages. It took about ten minutes for the chain reaction to cripple the network.

The last excerpt is from the AT&T telephone network crash of 1991 as described in Bruce Sterling’s book, The Hacker Crackdown (I read it on the gopher in the early 90s).

Making hardware that is perfect and fault resistant is difficult and extremely expensive. The prevailing thought process is to assume hardware (processors, memory, disks, network, etc.) will fail and solve the problem in software. You design for it in your software:

Among hundreds of servers in a GFS cluster, some are bound to be unavailable at any given time. We keep the overall system highly available with two simple yet effective strategies: fast recovery and replication.

You add layers on top of it to account for faults in the fault tolerant architecture and you add layers to test the layers (Netflix Chaos Monkey):

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

Software fault tolerance is also difficult and expensive and has its own faults as the above examples show. With so many years of software engineering learning, why is it difficult?


During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.


And sure enough, within the week, a red-faced software company, DSC Communications Corporation of Plano, Texas, owned up to “glitches” in the “signal transfer point” software that DSC had designed for Bell Atlantic and Pacific Bell. The immediate cause of the July 1 Crash was a single mistyped character: one tiny typographical flaw in one single line of the software. One mistyped letter, in one single line, had deprived the nation’s capital of phone service. It was not particularly surprising that this tiny flaw had escaped attention: a typical System 7 station requires ten million lines of code.

Can you prevent a admin from pulling the wrong CAT6 wire or typing eth2 instead of eth1? Can you prevent a single character typo in ten million lines of code?

Power Management vs Low Power Design: Watts the difference?

I often get asked why the blog has zero content about my work. If I were to blog about work it will/should be on our corporate blogspace. Search for it. However, I do get asked another question from friends outside work and the brief answer is useful for others.

Most of the people I talk to assume power management in chips is doing low power design.

Low Power design is typically about Watts; reducing the the energy spent in doing the required computation or communication. Power Management is about optimizing Performance/Watts. Managing power can be achieved by reducing Watts by using Low Power design techniques. However, Power Management also involves achieving the best performance for a given application space using the given power budget.

Power Management is not just about Power but about Performance!

DD-WRT Install: Pleasant Hacking

I was one of the early PlayOn paid users and upgraded to Premium recently. It really does work very well on wifi but not too well for video streaming on 3G. If you are interested in remote media streaming, you should definitely check it out. PlayOn and RemotePotato on my Media Center PC have opened up my media for access when I am out of town or when am at a soccer game.

However, my trusty old Linksys WRT54G was showing its age. I had multiple services at home that I wanted to access behind my router but linksys does not allow any static allocation of IPs from the DHCP pool (aka DHCP Reservation). The first few search results showed that DD-WRT supports DHCP reservation. I subscribe to the “hack before you buy” philosophy which is not a very family friendly philosophy. However, a router is mostly invisible to the family so I was willing to try hacking it.

I was pleasantly surprised at how easy it was without having to wade through lot of documentation or innumerable steps.

  • Lookup router version number
  • Download mini and full versions
  • Do a 30-30-30 reset
  • Flash with mini version
  • Do a 30-30-30 reset
  • Flash with full version
  • Do a 30-30-30 reset
  • Configure personal taste

The fact that doing a 30-30-30 reset (hard reset) was the most difficult portion of the install had me questioning why I had not flashed it with DD-WRT years ago.

The web configuration interface is slick and as good as the linksys original if not better. There are a lot more options to play with in DD-WRT compared to the linksys firmware. Now all my services including my Pioneer VSX1120 are easily accessible from my dyndns mapped host. Isn’t open source wonderful?