|
Uno score keeper |
| March 31st, 2013 under Devel, OSS, rengolin, Software. [ Comments: none ]
|
|
With the spring not coming soon, we had to improvise during the Easter break and play Uno every night. It’s a lot of fun, but it can take quite a while to find a piece of clean paper and a pen that works around the house, so I wondered if there was an app for that. It turns out, there wasn’t!
There were several apps to keep card game scores, but every one was specific to the game, and they had ads, and wanted access to the Internet, so I decided it was worth it writing one myself. Plus, that would finally teach me to write Android apps, a thing I was delaying to get started for years.
The App
 Card Game Scores
The app is not just a Uno score keeper, it’s actually pretty generic. You just keep adding points until someone passes the threshold, when the poor soul will be declared a winner or a loser, depending on how you set up the game. Since we’re playing every night, even the 30 seconds I spent re-writing our names was adding up, so I made it to save the last game in the Android tuple store, so you can retrieve it via the “Last Game” button.
It’s also surprisingly easy to use (I had no idea), but if you go back and forth inside the app, it cleans the game and start over a new one, with the same players, so you can go on as many rounds as you want. I might add a button to restart (or leave the app) when there’s a winner, though.
I’m also thinking about printing the names in order in the end (from victorious to loser), and some other small changes, but the way it is, is good enough to advertise and see what people think.
If you end up using, please let me know!
Download and Source Code
The app is open source (GPL), so rest assured it has no tricks or money involved. Feel free to download it from here, and get the source code at GitHub.
|
|
Distributed Compilation on a Pandaboard Cluster |
| February 13th, 2013 under Devel, Distributed, OSS, rengolin. [ Comments: 2 ]
|
|
This week I was experimenting with the distcc and Ninja on a Pandaboard cluster and it behaves exactly as I expected, which is a good thing, but it might not be what I was looking for, which is not.
Long story short, our LLVM buildbots were running very slow, from 3 to 4.5 hours to compile and test LLVM. If you consider that at peak time (PST hours) there are up to 10 commits in a single hour, the buildbot will end up testing 20-odd patches at the same time. If it breaks in unexpected ways, of if there is more than one patch on a given area, it might be hard to spot the guilty.
We ended up just avoiding the make clean step, which put us around 15 minutes build+tests, with the odd chance of getting 1 or 2 hours tops, which is a great deal. But one of the alternatives I was investigating is to do a distributed build. More so because of the availability of cluster nodes with dozens of ARM cores inside, we could make use of such a cluster to speed up our native testing, even benchmarking on a distributed way. If we do it often enough, the sample might be big enough to account for the differences.
The cluster
So, I got three Pandaboards ES (dual Cortex-A9, 1GB RAM each) and put the stock Ubuntu 12.04 on them and installed the bare minimum (vim, build-essential, python-dev, etc), upgraded to the latest packages and they were all set. Then, I needed to find the right tools to get a distributed build going.
It took a bit of searching, but I ended up with the following tool-set:
- distcc: The distributed build dispatcher, which knows about the other machines in the cluster and how to send them jobs and get the results back
- CMake: A Makefile generator which LLVM can use, and it’s much better than autoconf, but can also generate Ninja files!
- Ninja: The new intelligent builder which not only is faster to resolve dependencies, but also has a very easy way to change the rules to use distcc, and also has a magical new feature called pools, which allow me to scale job types independently (compilers, linkers, etc).
All three tools had to be compiled from source. Distcc’s binary distribution for ARM is too old, CMake’s version on that Ubuntu couldn’t generate Ninja files and Ninja doesn’t have binary distributions, full stop. However, it was very simple to get them interoperating nicely (follow the instructions).
You don’t have to use CMake, there are other tools that generate Ninja files, but since LLVM uses CMake, I didn’t have to do anything. What you don’t want is to generate the Ninja files yourself, it’s just not worth it. Different than Make, Ninja doesn’t try to search for patterns and possibilities (this is why it’s fast), so you have to be very specific on the Ninja file on what you want to accomplish. This is very easy for a program to do (like CMake), but very hard and error prone for a human (like me).
Distcc
To use distcc is simple:
- Replace the
compiler command by distcc compiler on your Ninja rules;
- Set the environment variable
DISTCC_HOSTS to the list of IPs that will be the slaves (including localhost);
- Start the distcc daemon on all slaves (not on the master):
distccd --daemon --allow <MasterIP>;
- Run ninja with the number of CPUs of all machines + 1 for each machine. Ex:
ninja -j6 for 2 Pandaboards.
A local build, on a single Pandaboard of just LLVM (no Clang, no check-all) takes about 63 minutes. With distcc and 2 Pandas it took 62 minutes!
That’s better, but not as much as one would hope for, and the reason is a bit obvious, but no less damaging: The Linker! It took 20 minutes to compile all of the code, and 40 minutes to link them into executable. That happened because while we had 3 compilation jobs on each machine, we had 6 linking jobs on a single Panda!
See, distcc can spread the compilation jobs as long as it copies the objects back to the master, but because a linker needs all objects in memory to do the linking, it can’t do that over the network. What distcc could do, with Ninja’s help, is to know which objects will be linked together, and keep copies of them on different machines, so that you can link on separate machines, but that is not a trivial task, and relies on an interoperation level between the tools that they’re not designed to accept.
Ninja Pools
And that’s where Ninja proved to be worth its name: Ninja pools! In Ninja, pools are named resources that bundle together with a specific level of scalability. You can say that compilers scale free, but linkers can’t run more than a handful. You simply need to create a pool called linker_pool (or anything you want), give it a depth of, say, 2, and annotate all linking jobs with that pool. See the manual for more details.
With the pools enabled, a distcc build on 2 Pandaboards took exactly 40 minutes. That’s 33% of gain with double the resources, not bad. But, how does that scale if we add more Pandas?
How does it scale?
To get a third point (and be able to apply a curve fit), I’ve added another Panda and ran again, with 9 jobs and linker pool at 2, and it finished in 30 minutes. That’s less than half the time with three times more resources. As expected, it’s flattening out, but how much more can we add to be profitable?
I don’t have an infinite number of Pandas (nor I want to spend all my time on it), so I just cheated and got a curve fitting program (xcrvfit, in case you’re wondering) and cooked up an exponential that was close enough to the points and use the software ability to do a best fit. It came out with 86.806*exp(-0.58505*x) + 14.229, which according to Lybniz, flattens out after 4 boards (about 20 minutes).
Pump Mode
Distcc has a special mode called pump mode, in which it pushes with the C file, all headers necessary to compile it solely on the node. Normally, distcc will pre-compile on the master node and send the pre-compiled result to the slaves, which convert to object code. According to the manual, this could improve the performance 10-fold! Well, my results were a little less impressive, actually, my 3-Panda cluster finished in just about 34 minutes, 4 minutes more than without push mode, which is puzzling.
I could clearly see that the files were being compiled in the slaves (distccmon-text would tell me that, while there was a lot of “preprocessing” jobs on the master before), but Ninja doesn’t print times on each output line for me to guess what could have slowed it down. I don’t think there was any effect on the linker process, which was still enabled in this mode.
Conclusion
Simply put, both distcc and Ninja pools have shown to be worthy tools. On slow hardware, such as the Pandas, distributed builds can be an option, as long as you have a good balance between compilation and linking. Ninja could be improved to help distcc to link on remote nodes as well, but that’s a wish I would not press on the team.
However, scaling only to 4 boards will reduce a lot of the value for me, since I was expecting to use 16/32 cores. The main problem is again the linker jobs working solely on the master node, and LLVM having lots and lots of libraries and binaries. Ninja’s pools can also work well when compiling LLVM+Clang on debug mode, since the objects are many times bigger, and even on above average machine you can start swapping or even freeze your machine if using other GUI programs (browsers, editors, etc).
In a nutshell, the technology is great and works as advertised, but with LLVM it might not be yet the thing. It’s still more profitable to get faster hardware, like the Chromebooks, that are 3x faster than the Pandas and cost only marginally more.
Would also be good to know why the pump mode has regressed in performance, but I have no more time to spend on this, so I leave as a exercise to the reader.
|
|
LLVM Vectorizer |
| February 12th, 2013 under Algorithms, Devel, rengolin. [ Comments: 2 ]
|
|
Now that I’m back working full-time with LLVM, it’s time to get some numbers about performance on ARM.
I’ve been digging the new LLVM loop vectorizer and I have to say, I’m impressed. The code is well structured, extensible and above all, sensible. There are lots of room for improvement, and the code is simple enough so you can do it without destroying the rest or having to re-design everything.
The main idea is that the loop vectorizer is a Loop Pass, which means that if you register this pass (automatically on -O3, or with -loop-vectorize option), the Pass Manager will run its runOnLoop(Loop*) function on every loop it finds.
The three main components are:
- The Loop Vectorization Legality: Basically identifies if it’s legal (not just possible) to vectorize. This includes checking if we’re dealing with an inner loop, and if it’s big enough to be worth, and making sure there aren’t any conditions that forbid vectorization, such as overlaps between reads and writes or instructions that don’t have a vector counter-part on a specific architecture. If nothing is found to be wrong, we proceed to the second phase:
- The Loop Vectorization Cost Model: This step will evaluate both versions of the code: scalar and vector. Since each architecture has its own vector model, it’s not possible to create a common model for all platforms, and in most cases, it’s the special behaviour that makes vectorization profitable (like 256-bits operations in AVX), so we need a bunch of cost model tables that we consult given an instruction and the types involved. Also, this model doesn’t know how the compiler will lower the scalar or vectorized instructions, so it’s mostly guess-work. If the vector cost (normalized to the vector size) is less than the scalar cost, we do:
- The Loop Vectorization: Which is the proper vectorization, ie. walking through the scalar basic blocks, changing the induction range and increment, creating the prologue and epilogue, promote all types to vector types and change all instructions to vector instructions, taking care to leave the interaction with the scalar registers intact. This last part is a dangerous one, since we can end up creating a lot of copies from scalar to vector registers, which is quite expensive and was not accounted for in the cost model (remember, the cost model is guess-work based).
All that happens on a new loop place-holder, and if all is well at the end, we replace the original basic blocks by the new vectorized ones.
So, the question is, how good is this? Well, depending on the problems we’re dealing with, vectorizers can considerably speed up execution. Especially iterative algorithms, with lots of loops, like matrix manipulation, linear algebra, cryptography, compression, etc. In more practical terms, anything to do with encoding and decoding media (watching or recording videos, pictures, audio), Internet telephones (compression and encryption of audio and video), and all kinds of scientific computing.
One important benchmark for that kind of workload is Linpack. Not only Linpack has many examples of loops waiting to be vectorized, but it’s also the benchmark that defines the Top500 list, which classifies the fastest computers in the world.
Benchmarks
So, both GCC and Clang now have the vectorizers turned on by default with -O3, so comparing them is as simple as compiling the programs and see them fly. But, since I’m also interested in seeing what is the performance gain with just the LLVM vectorizer, I also disabled it and ran a clang with only -O3, no vectorizer.
On x86_64 Intel (Core i7-3632QM), I got these results:
| Compiler |
Opt |
Avg. MFLOPS |
Diff |
| Clang |
-O3 |
2413 |
0.0% |
| GCC |
-O3 vectorize |
2421 |
0.3% |
| Clang |
-O3 vectorize |
3346 |
38.6% |
This is some statement! The GCC vectorizer exists for a lot longer than LLVM’s and has been developed by many vectorization gurus and LLVM seems to easily beat GCC in that field. But, a word of warning, Linpack is by no means representative of all use cases and user visible behaviour, and it’s very likely that GCC will beat LLVM on most other cases. Still, a reason to celebrate, I think.
This boost mean that, for many cases, not only the legality if the transformations are legal and correct (or Linpack would have gotten wrong results), but they also manage to generate faster code at no discernible cost. Of course, the theoretical limit is around 4x boost (if you manage to duplicate every single scalar instruction by a vector one and the CPU has the same behaviour about branch prediction and cache, etc), so one could expect a slightly higher number, something on the order of 2x better.
It depends on the computation density we’re talking about. Linpack tests specifically the inner loops of matrix manipulation, so I’d expect a much higher ratio of improvement, something around 3x or even closer to 4x. VoIP calls, watching films and listening to MP3 are also good examples of densely packet computation, but since we’re usually running those application on a multi-task operating system, you’ll rarely see improvements higher than 2x. But general applications rarely spend that much time on inner loops (mostly waiting for user input and then doing a bunch of unrelated operations, hardly vectorizeable).
Another important aspect of vectorization is that it saves a lot of battery juice. MP3 decoding doesn’t really matter if you finish in 10 or 5 seconds, as long as the music doesn’t stop to buffer. But taking 5 seconds instead of 10 means that on the other 5 seconds the CPU can reduce its voltage and save battery. This is especially important in mobile devices.
What about ARM code?
Now that we know the vectorizer works well, and the cost model is reasonably accurate, how does it compare on ARM CPUs?
It seems that the grass is not so green on this side, at least not at the moment. I have reports that on ARM it also reached the 40% boost similar to Intel, but what I saw was a different picture altogether.
On a Samsung Chromebook (Cortex-A15) I got:
| Compiler |
Opt |
Avg. MFLOPS |
Diff |
| Clang |
-O3 |
796 |
0.0% |
| GCC |
-O3 vectorize |
736 |
-8.5% |
| Clang |
-O3 vectorize |
773 |
-2.9% |
The performance regression can be explained by the amount of scalar code intermixed with vector code inside the inner loops as a result of shuffles (movement of data within the vector registers and between scalar and vector registers) not being lowered correctly. This most likely happens because the LLVM back-end relies a lot on pattern-matching for instruction selection (a good thing), but the vectorizers might not be producing the shuffles in the right pattern, as expected by each back-end.
This can be fixed by tweaking the cost model to penalize shuffles, but it’d be good to see if those shuffles aren’t just mismatched against the patterns that the back-end is expecting. We will investigate and report back.
Update
Got results for single precision floating point, which show a greater improvement on both Intel and ARM.
On x86_64 Intel (Core i7-3632QM), I got these results:
| Compiler |
Opt |
Avg. MFLOPS |
Diff |
| Clang |
-O3 |
2530 |
0.0% |
| GCC |
-O3 vectorize |
3484 |
37.7% |
| Clang |
-O3 vectorize |
3996 |
57.9% |
On a Samsung Chromebook (Cortex-A15) I got:
| Compiler |
Opt |
Avg. MFLOPS |
Diff |
| Clang |
-O3 |
867 |
0.0% |
| GCC |
-O3 vectorize |
788 |
-9.1% |
| Clang |
-O3 vectorize |
1324 |
52.7% |
Which goes on to show that the vectorizer is, indeed, working well for ARM, but the costs of using the VFP/NEON pipeline outweigh the benefits. Remember than NEON vectors are only 128-bits wide and VFP only 64-bit wide, and NEON has no double precision floating point operations, so they’ll only do one double precision floating point operations per cycle, so the theoretical maximum depends on the speed of the soft-fp libraries.
So, in the future, what we need to be working is the cost model, to make sure we don’t regress in performance, and try to get better algorithms when lowering vector code (both by making sure we match the patterns that the back-end is expecting, and by just finding better ways of vectorizing the same loops).
Conclusion
Without further benchmarks it’s hard to come to a final conclusion, but it’s looking good, that’s for sure. Since Linpack is part of the standard LLVM test-suite benchmarks, fixing this and running it regularly on ARM will at least avoid any further regressions… Now it’s time to get our hands dirty!
|
|
Hypocrite Internet Freedom |
| December 11th, 2012 under Digital Rights, Politics, rengolin, Web, World. [ Comments: none ]
|
|
Last year, the Internet has shown its power over governments, when we all opposed to the SOPA and PIPA legislations in protests across the world, including this very blog. Later on, against ACTA and so on, and we all felt very powerful indeed. Now, a new thread looms over the Internet, the ITU is trying to take over the Internet.
To quote Ars Technica:
Some of the world’s most authoritarian regimes introduced a new proposal at the World Conference on International Telecommunications on Friday that could dramatically extend the jurisdiction of the International Telecommunication Union over the Internet.
Or New Scientist:
This week, 2000 people have gathered for the World Conference on International Telecommunications (WCIT) in Dubai in the United Arab Emirates to discuss, in part, whether they should be in charge.
And stressing that:
WHO runs the internet? For the past 30 years, pretty much no one.
When in reality, the Internet of today is actually in the precise state the US is trying to avoid, only that now they’re in control, and the ITU is trying to change it to an international organization, where more countries have a say.
Today, the DNS and the main IP blocks are controlled by the ICANN, however, Ars Technica helps us reminding that ICANN and IANA are:
the quasi-private organizations that currently oversee the allocation of domain names and IP addresses.
But the ICANN was once a US government operated body, still with strong ties with Washington, localized solely on the US soil, operating on US law jurisdiction. They also failed on many accounts to democratize their operations, resulting in little or no impact for international input. Furthermore, all top level domains that are not bound to a country (like .com, .org, .net) are also within American jurisdiction, even if they’re hosted and registered in another country.
But controlling the DNS is only half the story. The control that the US has on the Internet is much more powerful. First, they hold (for historical and economical reasons), most of the backbone of the Internet (root DNS servers, core routers, etc). That means the traffic between Europe and Japan will probably pass through them. In theory, this shouldn’t matter and it’s actually an optimization of the self-structuring routing tables, but in fact, the US government has openly reported that they do indeed monitor all traffic that goes within their borders and they do reserve the right to cut it, if they think this presents a risk of national security.
Given the amount of publicity the TSA had since 2001 for their recognition of what poses a security threat, including Twitter comments from British citizens, I wouldn’t trust them, or their automated detection system to care for my security. Also, given the intrusion that they have on some governments like the case of Dotcom in January, where national security operations in New Zealand were shared inappropriately with the American government, I never felt safe when crossing American soil, physically or through the Internet.
Besides, Hollywood has shown in Scandinavia and in UK that they hold a strong leash on European governments when related to (US) copyright laws, forcing governments, once liberals, to abide to American rules, arresting their own citizens, when content is being distributed over the Internet. It’s also interesting to remember than SOPA, PIPA and ACTA, mainly driven by Hollywood, were all created within closed doors.
So, would ITU control be better?
No. Nothing could be further from the truth. Although, in theory, it’s more democratic (more countries with decision power), this decision power has been sought for one main purpose: to enforce more strict laws. I generally agree that the ITU would not be a good controlling body, but believing that nobody controls the Internet is, at least, naive, and normally a pretentious lie.
A legal control of many countries over something as free as the Internet would impose the same dangers as having it free of legal control, since it leaves us with indirect control from the strongest player, which so far, has been the US. The other countries are only so strongly minded about the ITU because the US won’t let them have their voices, and the ITU is a way to create an UN for the Internet.
In that sense, the ITU would be a lot like the UN. Worthless. A puppet in the hands or the strong players. Each country would have more control over their borders, and that would impact almost nothing in the US, but the general rules would stop being valid, and the US (and other countries) would have to do a lot more work than they do today. One example is the stupid rule in the UK where the sites, including international ones, have to warn users that they are using cookies.
Don’t be fooled, the US government is not really worried about your safety and security, nor your freedom. They’re trying to avoid a lot of work, and a big loss in market in the Middle East and South Asia. With countries (that they like to say are authoritarian regimes) imposing stricter rules on traffic, including fees, taxes and other things that they have on material goods, the commerce with those governments will be a lot more expensive.
Ever since the second world war, the US economy is based mainly on military activities. First, helping Europe got them out of the big depression, then they forced rebellions throughout Latin America to keep the coins clinking and currently, it’s the Middle East. With the climate change endangering their last non-war resources (oil), they were betting on the Internet to spread the American Way Of Life to the less fortunate, with the off chance of selling a few iPads on the process, but now, that profit margin is getting dangerously thin.
Not to mention the military threat, since a lot of the intelligence is now being gathered through the Internet, and recent attacks on Iranian nuclear power plants via the Stuxnet worm, would all become a lot harder. The fact that China is now bigger and more powerful than they are, in every possible aspect (I dare say even military, but we can’t know for sure), is also not helping.
What is then, the solution? Is it possible to really have nobody running the Internet? And, if at all possible, is it desirable?
Mad Max Internet
I don’t think so.
It’s true that IPv6 should remove completely the need for IP allocation, but DNS is a serious problem. Letting DNS registration to an organic self-organized process would lead to widespread malicious content being distributed and building security measures around it would be much harder than they already are. The same is true with SSL certificates. You’d expect that, on a land with no rules, trusted bodies would charge a fortune and extort clients for a safe SSL certificate, if they actually produce a good one, that is, but this is exactly what happens today, on ICANN rule.
Routing would also be affected, since current algorithms rely on total trust between parties. There was a time when China had all US traffic (including governmental and military) through its routers, solely done via standard BGP rules. On a world where every country has its own core router, digitally attacking another country would be as easy as changing one line on a router.
We all love to think that the Internet is a free world already, but more often than ever, people are being arrested for their electronic behaviour. Unfortunately, because there isn’t a set of rules, or a governing body, the rules that get people arrested are the rules of the strongest player, which in our current case, is Hollywood. So, how is it possible to reconcile security, anonymity and stability without recurring to governing bodies?
The simple answer is, it’s not. The Internet is a land with no physical barriers, where contacting people over 1000s of miles is the same as the one besides you, but we don’t live in a world without borders. It’s not possible to reconcile the laws of all countries, with all the different cultures, into one single book. As long as the world keeps its multiculturalism, we have to cope with different rules for different countries, and I’m not in favour of losing our identity just to make the Internet a place comfortable to the US government.
Regulating multi-body
It is my opinion that we do, indeed, need a regulating body. ICANN, ITU, it doesn’t matter, as long as the decisions are good for most.
I don’t expect that any such governing body would come up with a set of rules that are good for everybody, nor that they’ll find the best rules in the first N iterations (for large N), but if the process is fair, we should reach consensus (when N tends to infinity). The problem with both ICANN and ITU is that neither are fair, and there are other interests at play that are weighted much more than the interests of the people.
Since no regulating body, governmental or not, will ever account for the interests of the people (today or ever), people tend to hope that no-rule is the best rule, but I hope I have shown that this is not true. I believe that instead, a governing multi-body is the real solution. It’s hypocrite to believe that Russia will let the US create regulations within its borders, so we can’t assume that will ever happen from start, if we want it to work in the long run. So this multi-body, composed by independent organizations in Europe, Asia, Oceania, Africa and Americas would have strong powers on their regions, but would have to agree on very general terms.
The general terms would be something like:
- There should be no cost associated with the traffic to/from/across any country to any other country
- There should be no filtering of any content across countries, but filtering should be possible to/from a specific country or region based on religious or legal grounds
- It should be possible for countries to deny certain types of traffic (as opposed to filtering above), so that routing around would be preferred
- Misuse of Internet protocols (such as BGP and DNS spoofing) on root routers/DNS servers should be considered an international crime with the country responsible for the server in charge of the punishments or sanctions against that country could be enforced by the UN
- Legal rights and responsibilities on the Internet should be similar (but not identical) as they are on the physical world, but each country has the right and duty to enforce their own rules
Rule 1 is fundamental and would cut short most of the recent ITU’s proposals. It’s utter nonsense to cross-charge the Internet as it is to do it with telecoms around the world, and that is probably the biggest problem of the new proposal.
Rules 2 and 3 would leave control over regional Internet with little impact on the rest. It’d also foment creation of new routes around problematic countries, which is always beneficial to the Internet reliability as a whole. It’s hypocrite to assume that the US government has the right to impose Internet rules on countries like Iran or China, and it’s up to the people of China and Iran to fight their leaders on their own terms.
It’s extremely hypocrite, and very common, in the US to believe that their system (the American Way of Life) is the best for every citizen of the world, or that the people of other countries have no way of choosing their own history. It’s also extremely hypocrite to blame authoritarian governments on Internet regulations and at the same time provide weapons and support local authoritarian groups. Let’s not forget the role of the US on Afghanistan and Iraq prior to the Gulf War, as opposition to Russia and Iran (respectively), and their pivot role on all major authoritarian revolution in Latin America.
Most countries, including Russia and the ones in Middle East would probably be fine with rules 2 and 3, with little impact on the rest of the world. Which leaves us with rule 4, to account for the trust-worthiness of the whole system. Today, there is a gang of a few pals who control the main routers and giving more control over less trust-worthy pals over DNS and BGP routes would indeed be a problem.
However, in fact, this rule is in vigour today, since China routed US traffic for only 18 minutes. It was more a show of power than a real attack, but had China been doing this for too long, the US would think otherwise and with very strong reasons. The loose control is good, but the loose responsibility is not. Countries should have the freedom to structure their Internet backbones but also do it responsibly, or be punished otherwise.
Finally, there’s rule 5. How to account when a citizen of one country behaves in another country’s website as it’s legal for his culture, but not the other? Strong religious and ethical issues will arise from that, but nothing that there isn’t already on the Internet. Most of the time, this problem is identical to what already happens on the real world, with people from one country that commit crimes on another country. The hard bit is to know what are the differences between physical and logical worlds and how to reconcile the differences in interpretation of the multiple groups that will take part on such governing multi-body.
Conclusion
ITU’s proposal is not good, but ICANN’s is neither. The third alternative, to lack complete control is only going to make it worse, so we need a solution that is both viable and general enough, so that most countries agree to it. It also needs to relinquish control of internal features to their own governments in a way to not affect the rest of the Internet.
I argue that one single body, being it ITU or ICANN, is not a good model, since it’s not general enough nor they account for specific regions’ concerns (ICANN won’t listen to the Middle East and ITU won’t regard the US). So, the only solution I can see possible is one that unites them all into a governing multi-body, with very little in global agreement, but with general rules powerful enough to guarantee that the Internet will be free forever.
The American constitution is a beautiful piece of writing, but in reality, over the years, their government have destroyed most of its beauty. So, long term self-check must also be a core part of this multi-body, with regular review and democratic decisions (sorry authoritarian regimes, it’s the only way).
In a nutshell, while it is possible to write the Internet Constitution and make it work in the long term, humanity is very likely not ready to do that yet, and we’ll probably see the destruction of the Internet in the next 10 years.
Sigh…
|
|
Open Source and Innovation |
| September 13th, 2012 under Corporate, OSS, rengolin, Technology. [ Comments: 1 ]
|
|
A few weeks ago, a friend (Rob) asked me a pertinent question: “How can someone innovate and protect her innovation with open source?”. Initially, I scorned off with a simple “well, you know…”, but this turned out to be a really hard question to answer.
The main idea is that, in the end, every software (and possibly hardware) will end up as open source. Not because it’s beautiful and fluffy, but because it seems to be the natural course of things nowadays. We seem to be moving from profiting on products, to giving them away and profiting on services. If that’s true, are we going to stop innovating at all, and just focus on services? What about the real scientists that move the world forward, are they also going to be flipping burgers?
Open Source as a business model
The reason to use open source is clear, the TCO fallacy is gone and we’re all used to it (especially the lawyers!), that’s all good, but the question is really what (or even when) to open source your own stuff. Some companies do it because they want to sell the value added, or plugins and services. Others do because it’s not their core business or they want to form a community, which would otherwise use the competitors’ open source solution. Whatever the reason is, more and more we seem to be open sourcing software and hardware at an increasing speed, some times it comes off as open source on its first day in the wild.
Open source is a very good cost sharing model. Companies can develop a third-party product, not related to their core areas (where they actually make money), and still claim no responsibility or ownership (which would be costly). For example, the GNU/Linux and FreeBSD operating systems tremendously reduce the cost of any application developer, from embedded systems to big distributed platforms. Most platforms today (Apple’s, Androids, set-top boxes, sat-navs, HPC clusters, web-servers, routers, etc) have them at their core. If each of these products had to develop their own operating system (or even parts of it), it wouldn’t be commercially viable.
Another example is the MeshPotato (in Puerto Rico) box, which uses open software and hardware initially developed by Village Telco (in South Africa). They can cover wide areas providing internet and VoIP telephony over the rugged terrain of Puerto Rico for under $30 a month. If they had to develop their hardware and software (including the OS), it’d cost no less than a few hundred pounds. Examples like that are abundant these days and it’s hard to ignore the benefits of Open Source. Even Microsoft, once the biggest closed-source zealot, who propagated the misinformation that open source was hurting the American Way of Life is now one of the biggest open source contributors on the planet.
So, what is the question then?
If open source saves money everywhere, and promotes incremental innovation that wouldn’t be otherwise possible, how can the original question not have been answered? The key was in the scope.
Rob was referring, in fact, to real chunky innovations. Those that take years to develop, many people working hard with one goal in mind, spending their last penny to possibly profit in the end. The true sense of entrepreneurship. Things that might profit from other open source technologies, but are so hard to make that even so it takes years to produce. Things like new chips, new medicines, real artificial intelligence software and hardware, etc. The open source savings on those projects are marginal. Furthermore, if you spend 10 years developing a software (or hardware) and open source it straight away, how are you ever going to get your investment money back? Unless you charge $500 a month in services to thousands of customers on day one, you won’t see the money back in decades.
The big misunderstanding, I think, it’s that this model no longer applies, so the initial question was invalid to begin with. I explain.
Science and Tecnology
300 years ago, if you were curious about something you could make a name for yourself very easily. You could barely call what they did science. They even called themselves natural philosophers, because what they did was mostly discovering nature and inquiring about its behaviour. Robert Hooke was a natural philosopher and a polymath, he kept dogs with their internals in the open just to see if it’d survive. He’d keep looking at things through a microscope and he named most of the small things we can see today.
Newton, Liebniz, Gauss, Euler and few others have created the whole foundation of modern mathematics. They are known for fundamentally changing how we perceive the universe. It’d be preposterous to assume that there isn’t a person today as bright as they were, but yet, we don’t see people changing our perception of the universe that often. The last spree was more than a hundred years ago, with Maxwell, Planck and Einstein, but still, they were corrections (albeit fundamental) to the model.
Today, a scientist contents in scratching the surface of a minor field in astrophysics, and he’ll probably get a Nobel for that. But how many of you can name more than 5 Nobel laureates? Did they really change your perception of the universe? Did they invent things such as real artificial intelligence or did they discover a better way of doing politics? Sadly, no. Not because they weren’t as smart as Newton or Leibniz, but because the easy things were already discovered, now we’re in for the hard and incremental science and, like it or not, there’s no way around it.
Today, if you wrapped tin foil around a toilet paper tube and played music with it, people would, at best, think you’re cute. Thomas Edison did that and was called a Wizard. Nokia was trying to build a smartphone, but they were trying to make it perfect. Steve Jobs made is almost useless, people loved it, and he’s now considered a genius. If you try to produce a bad phone today, people will laugh at you, not think you’re cute, so things are getting harder for the careless innovators, and that’s the crucial point. Careless and accidental innovation is not possible on any field that has been exploited long enough.
Innovation and Business
Innovation is like business, you only profit if there is a market that hasn’t been taken. If you try to invent a new PC, you will fail. But if you produce a computer that has a niche that has never been exploited (even if it’s a known market, like in the Nokia’s smartphone case), you’re in for the money. If you want to build the next AI software, and it marginally works, you can make a lot of money, whether you open source your software or not. Since people will copy (copyright and patent laws are not the same in every country), your profit will diminish with time, proportional to the novelty and the difficulty in copying.
Rob’s point went further, “This isn’t just a matter of what people can or can’t do, is what people should or should not do”. Meaning, shouldn’t we aim for a world where people don’t copy other people’s ideas as a principle, instead of accepting the fact that people copy? My answer is a strong and sounding: NO! For the love of all that’s good, NO!
The first reason is simply because that’s not the world we live in and it will not be as long as humanity remains human. There is no point in creating laws that do not apply to the human race, though it seems that people get away with that very easy these days.
The second point is that it breaks our society. An example: try to get into a bank and ask for investment on a project that will take 10 years to complete (at the cost of $10M) and the return will come during the 70 years that follows it (at a profit of $100′sM a year). The manager will laugh at you and call security. This is, however, the time it takes (today) for copyright in Hollywood to expire (the infamous Mickey Mouse effect), and the kind of money they deal with.
Imagine that a car manufacturer develops a much safer way of building cars, say magical air bags. This company will be able to charge a premium, not just because of the development costs, but also for its unique position in the market. With time, it’ll save more lives that any other car and governments will want that to be standard. But no other company can apply that to their cars, or at least not without paying a huge premium to the original developer. In the end, cars will be much more expensive in general, and we end up paying the price.
Imagine if there were patents for the telephone, or the TV or cars (I mean, the concept of a car) or “talking to another person over the phone”, or “reminding to call your parents once in a while”. It may look silly, but this is better than most patent descriptions! Most of the cost to the consumer would be patents to people that no longer innovate! Did you know that Microsoft makes more money with Android phones than Google? Their contributions to the platform? Nothing. This was an agreement over dubious and silly patents that most companies accepted as opposed to being sued for billions of dollars.
Conclusion
In my opinion, we can’t just live in the 16th century with 21st century technology. You can’t expect to be famous or profit by building an in-house piece of junk or by spotting a new planet. Open source has nothing to do with it. The problem is not what you do with your code, but how you approach the market.
I don’t want to profit at the expense of others, I don’t want to protect my stupid idea that anyone else could have had (or probably already had, but thought it was silly), just because I was smart enough to market it. Difficult technology is difficult (duh), and it’s not up to a team of experts to create it and market it to make money. Science and technology will advance from now on on a steady, baby-steps way, and the tendency is for this pace to get even slower and smaller.
Another important conclusion for me is that, I’d rather live in a world where I cannot profit horrendously from a silly idea just because I’ve patented it than have monopolies like pharma/banking/tobacco/oil/media controlling our governments, or more than directly, our lives. I think that the fact that we copy and destroy property is the most liberating fact of humanity. It’s the Robin Hood of modern societies, making sure that, one way or another, the filthy rich won’t continue getting richer. Explosive growth, monopolies, cartels, free trade and protection of property are core values that I’d rather see dead as a parrot.
In a nutshell, open source does not hinder innovation, protection of property does.
|
|
Declaration of Internet Freedom |
| July 3rd, 2012 under Digital Rights, Life, Media, Politics, rengolin, rvincoletto, World. [ Comments: 1 ]
|
|
We stand for a free and open Internet.
We support transparent and participatory processes for making Internet policy and the establishment of five basic principles:
- Expression: Don’t censor the Internet.
- Access: Promote universal access to fast and affordable networks.
- Openness: Keep the Internet an open network where everyone is free to connect, communicate, write, read, watch, speak, listen, learn, create and innovate.
- Innovation: Protect the freedom to innovate and create without permission. Don’t block new technologies, and don’t punish innovators for their users’ actions.
- Privacy: Protect privacy and defend everyone’s ability to control how their data and devices are used.
Don’t get it? You should be more informed on the power of the internet and what governments around the world have been doing to it.
Good starting places are: Avaaz, Ars Technica, Electronic Frontier Foundation, End Software Patents, Piratpartiet and the excellent Case for Copyright Reform.
Source: http://www.internetdeclaration.org/freedom
|
|
K-means clustering |
| June 20th, 2012 under Algorithms, Devel, rengolin. [ Comments: none ]
|
|

Clustering algorithms can be used with many types of data, as long as you have means to distribute them in a space, where there is the concept of distance. Vectors are obvious choices, but not everything can be represented into N-dimensional points. Another way to plot data, that is much closer to real data, is to allow for a large number of binary axis, like tags. So, you can cluster by the amount of tags the entries share, with the distance being (only relative to others) the proportion of these against the non-sharing tags.
An example of tag clustering can be viewed on Google News, an example of clustering on Euclidean spaces can be viewed on the image above (full code here). The clustering code is very small, and the result is very impressive for such a simple code. But the devil is in the details…
Each red dots group is generated randomly from a given central point (draws N randomly distributed points inside a circle or radius R centred at C). Each centre is randomly placed, and sometimes their groups collide (as you can see on the image), but that’s part of the challenge. To find the groups, and their centres, I throw random points (with no knowledge of the groups’ centres) and iterate until I find all groups.
The iteration is very simple, and consists of two steps:
- Assignment Step: For each point, assign it to the nearest mean. This is why you need the concept of distance, and that’s a tricky part. With Cartesian coordinates, it’s simple.
- Update Step: Calculate the real mean of all points belonging to each mean point, and update the point to be at it. This is basically moving the supposed (randomly guessed) mean to it’s rightful place.
On the second iteration, the means, that were randomly selected at first, are now closer to a set of points. Not necessarily points in the same cluster, but the cluster that has more points assigned to any given mean will slowly steal it from the others, since it’ll have more weight when updating it on step 2.
If all goes right, the means will slowly move towards the centre of each group and you can stop when the means don’t move too much after the update step.
Many problems will arise in this simplified version, for sure. For instance, if the mean is exactly in between two groups, and both pull it to their centres with equally strong forces, thus never moving the mean, thus the algorithm thinks it has already found its group, when in fact, it found two. Or if the group is so large that it ends up with two or more means which it belongs, splitting it into many groups.
To overcome these deficiencies, some advanced forms of K-means take into account the shape of the group during the update step, sometimes called soft k-means. Other heuristics can be added as further steps to make sure there aren’t two means too close to each other (relative to their groups’ sizes), or if there are big gaps between points of the same group, but that kind of heuristics tend to be exponential in execution, since they examine every point of a group in relation to other points of the same group.
All in all, still an impressive performance for such a simple algorithm. Next in line, I’ll try clustering data distributed among many binary axis and see how k-means behave.
|
|
Google knows what you searched last summer |
| March 3rd, 2012 under InfoSec, rvincoletto, Web, World. [ Comments: 3 ]
|
|
Despise all the controversy, Google started his new Privacy Policy last Thursday and whether you like it or not, you are being watched.
Being realistic, this is not far from what they were already doing: Google already tracked your searches, what you are watching on Youtube or your emails.
But before March, 1st, Google Plus, Youtube, Gmail and almost 60 Google products, were in different databases. With this change, Google guys are giving themselves the right to put all those products in just one big place, put one and one and one together to build a better and more complete online behaviour of YOU. And use it to chase YOU with their ads.
And you can’t opt out. If you want to use any Google product you are under their privacy policy.
It should be nonsense for me to tell you to stop using Google products. Almost everything you do in the internet today, from searches and emails, to finding a street and comparing products’ prices, is somehow through a Google product or related to it.
But you can at least reduce the amount of information that Google will be able to collect from you.
You can, for instance, delete your Google history going to https://www.google.com/history/ and clicking the button “Remove all Web History”

You can also configure your advertising settings here: https://www.google.com/settings/u/0/ads/preferences/

You can edit your settings or even opt out.
Another way to “confuse” Google is creating a different account for each Google service (if you can keep up with all usernames and passwords).
Or, when watching a video on Youtube or searching the Web, make sure you are not logged in to your Google account.
There is also the possibility to use browser plugins that work to protect your data, or even anonymous proxies.
But, the truth is, as soon as you type into your computer, click anything, visit at a page, talk through Skype, or even talk on a telephone, (mobile or fixed), those who want to, can spy on you.
At least now Google is coming clear and telling you that they are spying on you. It makes better sense to me than living in a fool’s paradise, where you still believe that you have control over your life.
|
|
Hypocrisy in Hollywood |
| March 3rd, 2012 under Articles, Digital Rights, rengolin. [ Comments: none ]
|
|

Paralegal‘s Peter Kim sent me this nice info-graphic about a short history of the media industry in Hollywood, and I thought I would share with you.
I’m not a Lawyer, but his site seems to have some good bite-sized information about copyrights and other law terms that we should all know if we are to avoid The Big Brother in our society. Most of it obviously only apply to the US, but as we all know, US law has been extended to the world far too much. British hackers being extradited to US, European citizens getting harassed by US media companies and Asian companies being shut down by the mighty power of Hollywood.
There are other info-graphics on the site that are worth looking at. Thanks for the tip, Peter.
|
|
Emergent behaviour |
| February 23rd, 2012 under Computers, Distributed, rengolin, Science. [ Comments: 1 ]
|
|
There is a lot of attention to emergent behaviour nowadays (ex. here, here, here and here), but it’s still on the outskirts of science and computing.
Science
For millennia, science has isolated each single behaviour of a system (or system of systems) to study it in detail, than join them together to grasp the bigger picture. The problem is that, this approximation can only be done with simple systems, such as the ones studied by Aristotle, Newton and Ampere. Every time scientists were approaching the edges of their theories (including those three), they just left as an exercise to the reader.
Newton has foreseen relativity and the possible lack of continuity in space and time, but he has done nothing to address that. Fair enough, his work was much more important to science than venturing throughout the unknowns of science, that would be almost mystical of him to try (although, he was an alchemist). But more and more, scientific progress seems to be blocked by chaos theory, where you either unwind the knots or go back to alchemy.
Chaos theory exists for more than a century, but it was only recently that it has been applied to anything outside differential equations. The hyper-sensibility of the initial conditions is clear on differential systems, but other systems have a less visible, but no less important, sensibility. We just don’t see it well enough, since most of the other systems are not as well formulated as differential equations (thanks to Newton and Leibniz).
Neurology and the quest for artificial intelligence has risen a strong interest in chaos theory and fractal systems. The development in neural networks has shown that groups and networks also have a fundamental chaotic nature, but more importantly, that it’s only though the chaotic nature of those systems that you can get a good amount of information from it. Quantum mechanics had the same evolution, with Heisenberg and Schroedinger kicking the ball first on the oddities of the universe and how important is the lack of knowledge of a system to be able to extract information from it (think of Schroedinger’s cat).
A network with direct and fixed thresholds doesn’t learn. Particles with known positions and velocities don’t compute. N-body systems with definite trajectories don’t exist.
The genetic code has some similarities to these models. Living beings have far more junk than genes in their chromosomes (reaching 98% of junk on human genome), but changes in the junk parts can often lead to invalid creatures. If junk within genes (introns) gets modified, the actual code (exons) could be split differently, leading to a completely new, dysfunctional, protein. Or, if you add start sequences (TATA-boxes) to non-coding region, some of them will be transcribed into whatever protein they could make, creating rubbish within cells, consuming resources or eventually killing the host.
But most of the non-coding DNA is also highly susceptible to changes, and that’s probably its most important function, adapted to the specific mutation rates of our planet and our defence mechanism against such mutations. For billions of years, the living beings on Earth have adapted that code. Each of us has a super-computer that can choose, by design, the best ratios for a giving scenario within a few generations, and create a whole new species or keep the current one adapted, depending on what’s more beneficial.
But not everyone is that patient…
Programming
Sadly, in my profession, chaos plays an important part, too.
As programs grow old, and programmers move on, a good part of the code becomes stale, creating dependencies that are hard to find, harder to fix. In that sense, programs are pretty much like the genetic code, the amount of junk increases over time, and that gives the program resistance against changes. The main problem with computing, that is not clear in genetics, is that the code that stays behind, is normally the code that no one wants to touch, thus, the ugliest and most problematic.
DNA transcriptors don’t care where the genes are, they find a start sequence and go on with their lives. Programmers, we believe, have free will and that gives them the right to choose where to apply a change. They can either work around the problem, making the code even uglier, or they can go on and try to fix the problem in the first place.
Non-programmers would quickly state that only lazy programmers would do the former, but more experienced ones will admit have done so on numerous occasions for different reasons. Good programmers would do that because fixing the real problem is so painful to so many other systems that it’s best to be left alone, and replace that part in the future (only they never will). Bad programmers are not just lazy, some of them really believe that’s the right thing to do (I met many like this), and that adds some more chaos into the game.
It’s not uncommon to try to fix a small problem, go more than half-way through and hit a small glitch on a separate system. A glitch that you quickly identify as being wrongly designed, so you, as any good programmer would do, re-design it and implement the new design, which is already much bigger than the fix itself. All tests pass, except the one, that shows you another glitch, raised by your different design. This can go on indefinitely.
Some changes are better done in packs, all together, to make sure all designs are consistent and the program behaves as it should, not necessarily as the tests say it would. But that’s not only too big for one person at one time, it’s practically impossible when other people are changing the program under your feet, releasing customer versions and changing the design themselves. There is a point where a refactoring is not only hard, but also a bad design choice.
And that’s when code become introns, and are seldom removed.
Networks
The power of networks is rising, slower than expected, though. For decades, people know about synergy, chaos and emergent behaviour, but it was only recently, with the quality and amount of information on global social interaction, that those topics are rising again in the main picture.
Twitter, Facebook and the like have risen so many questions about human behaviour, and a lot of research has been done to address those questions and, to a certain extent, answer them. Psychologists and social scientists knew for centuries that social interaction is greater than the sum of all parts, but now we have the tools and the data to prove it once and for all.
Computing clusters have being applied to most of the hard scientific problems for half a century (weather prediction, earthquake simulation, exhaustion proofs in graph theory). They also took on a commercial side with MapReduce and similar mechanisms that have popularised the distributed architectures, but that’s only the beginning.
On distributed systems of today, emergent behaviour is treated as a bug, that has to be eradicated. In the exact science of computing, locks and updates have to occur in the precise order they were programmed to, to yield the exact result one is expecting. No more, no less.
But to keep our system out of emergent behaviours, we invariably introduce emergent behaviour in our code. Multiple checks on locks and variables, different design choices for different components that have to work together and the expectancy of precise results or nothing, makes the number of lines of code grow exponentially. And, since that has to run fast, even more lines and design choices are added to avoid one extra operation inside a very busy loop.
While all this is justifiable, it’s not sustainable. In the long run (think decades), the code will be replaced or the product will be discontinued, but there is a limit to which a program can receive additional lines without loosing some others. And the cost of refactoring increases with the lifetime of a product. This is why old products don’t get too many updates, not because they’re good enough already, but because it’s impossible to add new features without breaking a lot others.
Distant future
As much as I like emergent behaviour, I can’t begin to fathom how to harness that power. Stochastic computing is one way and has been done with certain level of success here and here, but it’s far from easy to create a general logic behind it.
Unlike Turing machines, emergent behaviour comes from multiple sources, dressed in multiple disguises and producing far too much variety in results that can be accounted in one theory. It’s similar to string theory, where there are several variations of it, but only one M theory, the one that joins them all together. The problem is, nobody knows how this M theory looks like. Well, they barely know how the different versions of string theory look like, anyway.
In that sense, emergent theory is even further than string theory to be understood in its entirety. But I strongly believe that this is one way out of the conundrum we live today, where adding more features makes harder to add more features (like mass on relativistic speeds).
With stochastic computing there is no need of locks, since all that matter is the probability of an outcome, and where precise values do not make sense. There is also no need for NxM combination of modules and special checks, since the power is not in the computation themselves, but in the meta-computation, done by the state of the network, rather than its specific components.
But that, I’m afraid, I won’t see in my lifetime.
|
| « Previous entries |
|
|