Why does creating a list comprehension is faster?

Yesterday, I was removing a list-comprehension as an argument to sum(). This is the diff:

- sum([p.price for p in self.products])
+ sum(p.price for p in self.products)

To show the original programmer my line of though I performed a little experiment with the following message and code:

List comprehensions as an argument to reduce-like functions are usually less efficient than using the generation expression itself. The reason is that Python creates the list just to discard it afterwards:

>>> def fib(n):
...     a, b = 1, 1
...     while a < n:
...         yield a
...         a, b = b, a + b
...

>>> %timeit sum([x for x in fib(100)])
2.61 µs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit sum(x for x in fib(100))
3.13 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I was surprised to see that the list comprehension was a little bit faster than the generator expression. So it seems that for short-enough lists, the implementation of sum() is quite fast.

I had to change the implementation of fib to control the amount of items to show my point:

>>> def fib(n):
...     a, b = 1, 1
...     for _ in range(n):
...         yield a
...         a, b = b, a + b

An still with 100 items, passing a list is faster:

>>> %timeit sum(fib(100))
14.2 µs ± 212 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit sum(list(fib(100)))
16.4 µs ± 247 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit sum([x for x in fib(100)])
18 µs ± 160 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Of course the differences are too small to draw final conclusions. And, of course, as the list grows larger it becomes slower:

>>> %timeit sum([x for x in fib(10**5)])
497 ms ± 84.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit sum(x for x in fib(10**5))
329 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So I think I still prefer the version using the generation expression just to cover my self from the case where there are too many items to be held in memory.

(I didn’t push the commit, though.)

Composing iterator-returning functions

A few days ago I was reviewing a piece of Python code. I was looking for a bug, but in the process I found a very interesting function.

The system allows the user to “extend” the structure of Products by providing more attributes which can be used later on when creating Sale (or Purchase) Orders. The Product object is like a class defining the structure, and the items in Orders are the instances of such classes.

When trying to export a full catalog, each attribute implies a new row in the spreadsheet file. To avoid too much coupling, this process was modeled by a kind of seeded generation of every possible row. The algorithm started with a seed instance of a product without any attribute, and then it generated every possible attribute-complete instance by composing several functions that took a instance and returned a iterator of instances. Each function deals with a specific type of attribute, and simply copies those attributes in the instances being generated.

The function doing the generation of all possible instance was more or less like this:

def iter_product(initial, *funcs):
   if not funcs:
      yield initial
   else:
      f, *fs = funcs
      for x in f(initial):
         yield from iter_product(x, *fs)

That definition was OK, but I wondered if I could build just the composition of several functions returning iterators, so that I can reuse it with several initial objects.

A little incursion in Haskell

In order to test my Haskell, I did first a Haskell version. I started by trying to create a composition operator much like the (.) operator, which has type:

(.) :: (b -> c) -> (a -> b) -> a -> c

The type of our composition of iterator-returning functions would be:

infixr 9 >>.
(>>.) :: (b -> [c]) -> (a -> [b]) -> a -> [c]

The choice of (>>.) as the operator will become (I hope) evident. The most straightforward implementation and easy to understand is using the list-comprehension syntax:

g >>. f = \x -> [z| y <- f x, z <- g y]

Can we generalize this? Yes! The list is an instance of a Monad, defined as:

instance Monad [a] where
    return x = [x]
    m >>= f  = concat (map f m)

And list comprehensions can be easily rewritten using the do notation:

(>>.) :: Monad m => (b -> m c) -> (a -> m b) -> a -> m c
g >>. f = \x -> do{y <- f x; z <- g y; return z}

The monadic >>= operator is usually called the bind. It’s type is

Monad m => m a -> (a -> m b) -> m b

So, I think there’s a compact way to write our >>. operator. And, now you may start to see why I chose >>..

The do notation is just syntax-sugar over using >>= (or its brother >>). The rules are given here. So let’s transform our implementation. We start we our current definition:

\x -> do {y <- f x; z <- g y; return z}

And rewrite the do two times until there are no more:

\x -> let s1 y = do {z <- g y; return z} in f x >>= s1

\x -> let s1 y = (let s2 z = return z in g y >>= s2) in f x >>= s1

Now, we can recall the eta-conversion rule and see that s2 = return, so:

\x -> let s1 y = (g y >>= return) in f x >>= s1

Now we can use the monadic “law” that states the m >>= return must be equivalent to m:

\x -> let s1 y = g y in f x >>= s1

And, once more, the eta-conversion help us to remove the let, because s1 == g. Thus we get:

(>>.)  :: Monad m => (b -> m c) -> (a -> m b) -> a -> m c
g >>. f = \x -> f x >>= g

This is as good as I was able to get. Since we’re using >>=, I think this is the best we can get (i.e. we can’t generalize to Applicative).

Chaining several iterator-returning functions

Now, I can define a chain function. It takes a list of several a -> m a functions and compose them together (from right to left, as expected):

chain :: Monad m => [a -> m a] -> a -> m a

My first attempt was:

chain :: Monad m => [a -> m a] -> a -> m a
chain []  = return
chain (f:fs) = f >>. (chain fs)

But, then I realized that’s a fold:

chain :: (Foldable l, Monad m) => l (a -> m a) -> a -> m a
chain = foldr (>>.) return

And that completes our incursion in Haskell.

Doing the same in Python

Going from this Haskell definition of chain to Python is quite easy. But we’re not going to work with any possible monad, just lists (iterators, actually).

from functools import reduce

def iter_compose(*fs):
    if len(fs) == 2:
        # optimize the 'lambda x: [x]' for the *usual* case of 2-args.
        return _compose(*fs)
    else:
        return reduce(_compose, fs, lambda x: [x])

def _compose(g, f):
   return lambda x: (z for y in f(x) for z in g(y))

We have included iter_compose() in xoutil 1.9.6 and 2.0.6.

A failing fail in the monad list

I’m following the Real World Haskell book. In the chapter about Monads, they create simple example using the list monad compute all pairs of numbers (x, y) that such that x * y == n for a given n.

Their solution is:

multiplyTo :: Int -> [(Int, Int)]
multiplyTo n = do
      x <- [1..n]
      y <- [x..n]
      guarded (x * y == n) $
        return (x, y)

guarded :: Bool -> [a] -> [a]
guarded True xs = xs
guarded False _ = []

But I was wondering if I could restate guarded for any monad. Since fail in the list monad is fail _ = [], I though I could do:

guarded :: Monad m => Bool -> m a -> m a
guarded True = id
guarded False = fail "skipped"

However this actually fails in ghci:

*Main> multiplyTo 24
*** Exception: skipped

I let myself sleep, and this morning I figured there’s actually a small/important difference between guarded False _ = [] and guarded False = fail "..."

The type of guarded False is Monad m => m a -> m a. However the type of fail 'skipped' is just Monad m => m a.

Why does guarded False = fail "skipped" type-checks? That’s an open question (as of the time of writing).

If I define guarded as:

guarded :: Monad m => Bool -> m a -> m a
guarded True xs = xs
guarded False _ = fail "skipped"

Or:

guarded :: Monad m => Bool -> m a -> m a
guarded True = id
guarded False = \s -> fail "skipped"

Both work correctly.

Warning

fail is strongly discouraged to use in real world haskell code.

Dissecting foldl in terms of foldr

The book ‘Real World Haskell’ provides an implementation of foldl in terms of foldr:

myFoldl:: (b -> a -> b) -> b -> [a] -> b
myFoldl f z xs = foldr step id xs z
   where step x g a = g(f a x)

They ask the reader (me) to try to understand the implementation. They warn it’s not trivial, but I found it interesting. So, I just want to share my line of thought.

The first thing I noticed is that z is not an argument to foldr; you can rewrite the first line as:

(foldr step id xs) z

This makes visible that the result of the foldr is a function that takes z as an argument. By looking a the case foldr step id [], which equates to id, we can see that the type of the foldr’s result is b -> b.

This also implies that the type of step must be a -> (b -> b) -> (b -> b). You can prove that by extracting step:

step1 :: (b -> a -> b) -> a -> (b -> b) -> (b -> b)
step1 f x g a = g(f a x)

myFoldl1 :: (b -> a -> b) -> b -> [a] -> b
myFoldl1 f z xs = foldr (step1 f) id xs z

Use ghci to verify that.

Now comes the funny part of the implementation: step is defined with three arguments

where step x g a = g(f a x)

but foldr only passes the first two, and that’s totally fine; the result will be another function.

You can make this explicit:

myFoldl2 :: (b -> a -> b) -> b -> [a] -> b
myFoldl2 f z xs = foldr step id xs z
   where step x g = \a -> g(f a x)

That’s a beautiful thing about Haskell both definitions of step are actually indistinguishable:

step x g a = g(f a x)

step x g = \a -> g(f a x)

The first one takes an equation-like outlook that I find very elegant in a programming language.

Having all that we can easily follow how the expression myFoldl (+) 0 [1, 2] would proceed:

myFoldl (+) 0 [1, 2]
(foldr step id [1, 2]) 0                  -- by def. of myFoldl
(step 1 (foldr step id [2])) 0            -- by def. of foldr
(step 1 (step 2 (foldr step id []))) 0    -- by def. of foldr
(step 1 (step 2 id)) 0                    -- by def. of foldr for []
(step 1 (\a -> id(a + 2))) 0              -- applying `step 2 id`

(step 1 (\a -> a + 2)) 0                  -- applying id, Haskell might not
                                             do this right now

(\x -> (\a -> a + 2)(x + 1)) 0            -- applying step
(\x -> (x + 1) + 2) 0                     -- applying the inner lambda
(0 + 1) + 2                               -- applying the outer lambda

And that’s (more or less) how myFoldl works.

Non-curated notes about Cycle.js

In the past few days I have been reading about Reactive Programming. Mostly about how it’s done with cycle.js. As the title attempts to suggest, this post is by no means an account of well thought ideas, but my first ideas and notes while reading about it.

Cycle.js is attractive, and its appeal spans from two key properties in my opinion:

  • There’s a single type of connector between the components: the streams.

    Note

    Both sources and sinks return streams.

  • There are two distinct type of computational components: the drivers and the the main(sources) function.

    Note

    Intent, view and model functions may be regarded as internal structure of main().

This is appealing because the architecture is simple. You may understand the main points of this architecture in about an hour. You may recall this is actually an evolution (or instance) of the producer-consumer pattern.

There’s a third element in cycle.js: the cycle itself. Which is based on the ‘dialog abstraction’. Actually, this is what caught my attention in the first place. It goes like: your program outputs are the input to an external (possibly human) entity which, in turn, may react and produce more events in your program’s source streams.

Architecturally this is simple and good.

Components

Any Cycle.js app can be reused as a component in a larger Cycle.js app.

—From the Cycle.js documentation

When it comes to designing an application or component you have to decide about the type of events your application may receive and the expected output events it may produce in response.

The previous statement is only true in two cases:

  • Your ‘smaller’ Cycle.js application does not need to interact with other parts of your ‘larger’ application.
  • Your ‘smaller’ Cycle.js exposes its model; or you have “model drivers”.

I think this becomes rather obvious in the 16th part of the Egghead’s video series on Cycle.js. The one when they make the slider component and they need to expose the stream of ‘values’ from the slider.

Note

Actually the documentation states that components expose, somehow, its state.

Streams to the extreme (and localization)

The Egghead’s series exposes how Cycle.js became to be and you can, therefore, see the evolution it is. Is a short video series worth watching: right to the point without many detours.

There’s a video when they needed to make a component out of the Height and Weight sliders, and they needed to pass parameters like the label, min value and max value. I thought they were going to use old plain closures:

function Slider(sources, props) {
    const label = props.label;
    // etc...
    return main(sources);
}

It came as a surprise that they put those parameters into a props source to the main() function of the component.

At first I thought that was really far fetched, but after thinking about it a bit more I think is can be really useful.

I started to think about an application with just a slider and a language selector:

Height: 6 feet   <-----o-->

Choose language: [ENG]

The slider component is the one we see in the video series. The language selector is a selection component and the expected behavior is that when I change the language the entire application changes to the new language.

My first thought is that the language selector gets is value from a driver (which I don’t know if it already exists) that deals with localization. Let’s say you can obtain such a driver like this:

cycle.run(main, {
   // etc...
   locale: makeLocaleDriver({languages: ['en_US', 'es_ES'], ...})
})

The driver would let you respond to changes in translations:

const label$ = sources.locale.select('Height');

Or combining with another mapping from another stream so that the props stream remains almost unchanged:

const label$ = props$.map(p => sources.locale.select(p.label));

However, after revisiting that last idea, I noticed that it doesn’t work. A change in in the locale does not trigger any event in the props$ stream. Assuming that locale.current$ is a stream of localization object, this may work:

const label$ = props$.combine(sources.locale.current$)
                     .map((label, locale) => locale.gettext(label));

The thing gets a little bit more tricky when it comes to changing the units: feet vs meter, etc… I’ve been thinking about it for a bit. The most problematic issue is that state is not clearly owned unless we introduce a kind of quantity for which the unit of measure is explicit:

run(main, {
    props: xs.of({
       value: new Quantity(175, Unit.Length.cm),
    })
})

However this may seem a bit overreaching for a single value that only needs to be in between two boundaries (slider).

This is, IMO the breaking point: If I really need to manage units on my application and those need to be fully localized, my components might be regarded as over-engineered for other apps. My only hope is that a simple slider, without any knowledge of units, might be wrapped inside a FullyLocalizedSlider for that purpose.

Open questions

Most of the ideas exposed above are not battle tested. I happen to be evaluating whether I could use Cycle.js inside Odoo to develop some widgets that require almost real-times updates, and the stream interface is thus quite natural.

There are challenges about integrating my components with the rest of the application, and being an application that must display at least three languages I need to think on advance about the problems I would face.

The bus implementation in Odoo and notification systems in the Web

I commented on my post about our migration to Odoo that our users open many tabs to the same Odoo instance. When the chat is active this means an open connection for each tab and, since a chat is interactive in nature, all but one of these connections are not actually required [1].

A while ago I reported that switching from a threaded based deployment to a pre-forked processes one with a gevent based process for long polling connections, actually spiked the concurrency issues we were witnessing in our DB. Our DB complained about being unable to serialize concurrent transactions. By inspecting the logs we noticed that almost always this had to do with an update to the table im_chat_presence, that holds the status of users in the chat.

I thought the forked model would diminish the chance of collisions because (I thought) Odoo would use a single DB connection per worker process and another one for the gevent worker that would simply serialize all the chat related queries.

This spike was against all rationale I could think of, but honestly I didn’t check my assumptions by reading the code. They might just be the wrong assumptions in the first place, and I needed to really read the code to see what was really happening.

After restoring the threaded deployment the concurrency issues dropped again. They didn’t get to the same levels they were before, but, at the same time, we were introducing other modules (and the related staff), so there are more open tabs… Afterwards we went back to the process model, and the issues spiked again, however in a smaller amount.

Several weeks later I deployed a patch to the Odoo bus implementation that simply reduced the frequency of polling when the page is hidden. The concurrency issues were now mostly gone: from about 3k a day to no more than 50.

We still had unanswered questions regarding the DB connection management in Odoo. There were suspicious signs. For instance reducing the db_maxconn option to 8 or less made Odoo start complaining about the Connection Pool being full very often. We already knew that:

  • For each cron worker at least 2 connections are used: one holds the cron job locked while the other is used by the job itself.
  • The bus (used for the chat) also uses another connection to the DB “postgres”: Notifications and wake-ups of bus-related events use this connection.

The most suspicious sign, however, was that with a pool size of 32, this issue came from time to time.

Time went by and the issue remained at an acceptable (for us) level of few dozens per day. You may say it’s too much, but, bear with me, it’s reasonable for our hardware and user base.

Just a few days ago, we realized that issues coming from our gevent-based process were not being properly reported to our Sentry. We fixed the issue and observed that we’re getting about 2.5k DB collisions per day reports again. And once more the query the table where most of collisions happen is ‘im_chat_presence’.

As a workaround we simply reduced the update rate for that table. But a better solutions needed to be found.

Cross-tab polling

By looking better our users’ usage pattern, we discovered that users switch from tab to tab regularly and thus the polling is still high. So we went down the path of figuring out a way to have a single polling connection even when many were opened.

I remembered the Local Storage API and went down to write a simple proof of concept of listening events of the localStorage from different tabs. The proof of concept yielded good results, and so I turned to Google and search for “cross-tab communication” and found the tabex library.

After that, replacing the implementation of the bus for a new one that was based on a exchanged between tabs was not very hard. These are the commits done in our branch (in reverse topological order as in git log):

0555434 * bus: Clarify the channel delay in polls.
d10a1b5 * bus: Checks for the right state.
af4a6dc * bus: Add language tag in the documentation.
636490b * im_chat: Revert "Reduce the presure on im_chat_presence table."
eda7638 * bus: Don't notify if not in polling state.
4d09ac6 * web_celery: Use a dedicated bus exchange for jobs.
84239e3 * im_chat: Beep only in the master tab.
2cb27da * Improve the bus to allow tab synchronization.

The file addons/bus/static/src/js/bus.rst contains the new design explained.

A few weeks ago I did the first merge our 8.0 changes into the 9.0 branch. On the process I realized that Odoo 9.0’s bus was improved and now contains the a cross-tab communication feature.

So probably I end up backporting their implementation into 8.0.

Notes

[1]Of course it allows the web app to change the window’s title when then user has an unread message, but in our case this notification is also unneeded, because the message is actually being read in another tab.

The productivity of tomorrow trumps that of today

That’s probably harsh, but I think it is absolutely right. Doing crappy software today to be more productive today will make you less productive tomorrow. It’s this simple. And that’s cumulative too; meaning that if you neglect your future productivity, it will be slowly diminishing until a point of negative competitive disadvantage where you’ll find yourself struggling to keep up instead of innovating.

And it’s so damn exhausting to explain why…

Software complexity does not come from the tools, but from the mental framework required (and imposed at times) to understand it. So don’t ever think measuring Cyclomatic Complexity (CC) and other similar metrics will yield something close to the true measure of the quality of your code.

There are only two hard things in Computer Science: cache invalidation and naming things.

—Phil Karlton

def _check(l):
    if len(l) <= 1:
       return l
    l1, l2 = [], []
    m = l[0]
    for x in l[1:]:
       if x <=m :
          l1.append(x)
       else:
          l2.append(x)
    return _check(l1) + [m] +_check(l2)

This code has a nice CC of 4 which is very nice; yet it will take you at least a minute to figure out what it does. If only I had chosen to name the function quicksort

>>> _check([84, 95, 89, 4, 77, 24, 95, 86, 70, 16])
[4, 16, 24, 70, 77, 84, 86, 89, 95, 95]

A quiz in less than 5 seconds: What does it mean in OpenERP’s ORM the following lines of code?

group.write({'implied_ids': [(3, implied_group.id)]})

This line of code has a CC of 1: as good as it gets, isn’t it? But it’s so darn difficult to read that unless you have your brain wired up to see “forget the link between this group and the implied_group”… To be fair there is someone out there in the OpenERP universe that cared a bit about this:

# file addons/base/ir/ir_fields.py

CREATE = lambda values: (0, False, values)
UPDATE = lambda id, values: (1, id, values)
DELETE = lambda id: (2, id, False)
FORGET = lambda id: (3, id, False)
LINK_TO = lambda id: (4, id, False)
DELETE_ALL = lambda: (5, False, False)
REPLACE_WITH = lambda ids: (6, False, ids)

But no one else is using it!

And, yes, it’s exhausting going over this.

Odoo at last!

After several months, we have finally moved from OpenERP to Odoo. It took several rounds of testing, finding bugs and incompatibilities in our own modules and fixing them.

A summary of the process

The OpenUpgrade project proved to be very complete for us –it had almost everything we needed:

  • The Accounting and Project modules were done.
  • The CRM module was not yet integrated into the mainstream repository but a fork by Stefan Rijnhart had a branch that proved very complete for our needs.

We began by merging Stefan’s branch into our own clone of the repository. We came into a small issue and fixed it. Afterwards, some of our recurrent events were missing the timezone and the migration failed. For that we created the branch 8.0-valid-rrules and it did the job after a couple of tweeks.

The branch we actually used for the migration is merchise-8.0. It includes several commits we had to make for our customs addons to work. Those commits are probably too local to be merged back into the upstream. However, we have published them should anyone finds them useful.

The menu decomposition

The biggest issue we’ve encountered so far is that after several hours of operation all the menus stopped working: You clicked the “Leads” menu item and you got, say, the list of Many2Many fields. (???!!!)

A first inspection showed that all menu items in the HTML were somehow related to the same action id. Quickly, we truncated the ir_ui_menu table and its associated ir_model_data rows, upgraded the DB and all menus came to life once more.

The day after, the same problem happened. Deeper inspection of both the code and the DB showed the problem lied in the ir_values table. That table relates menu items with actions. The rows containing this relation showed nonsense: almost all the menu items were related to “res_partner,20”. And in the UI this was interpreted to be pointing to the action with id 20.

Comparing the table before and after the upgrade showed no problem there. So this issue had appeared after the migration itself.

Our previous method of truncating ir_ui_menu didn’t fix this issue cause the ir_values rows remained there crippling over and over again. Besides, truncating ir_ui_menu had the unintentional, bad side-effect of removing every Mail list in our DB.

We’re still on the track about how the ir.values were messed up in the first place.

The magical 6 tabs

Our users started to report slowness after a couple of hours of working. First, we noticed that the call to load_needaction was taking too long and find out this call asked for too much it did not need, we provided a fix and some loose benchmarks. Though this patch needs to be completed, the slowness was solved… for a moment.

Some users kept reporting the Odoo interface was freezing all the time, even inviting them to go out and drink a cup of coffee… But other users were happy about it after the patch. Closer inspection showed many of the AJAX calls were blocking waiting for a socket to be available.

It turned out these users had created a habit of opening several tabs when they worked with OpenERP. They kept this kind of behavior, but a couple of facts worked against them:

  • We have installed the Odoo chat. Many of our users are geographically spread, the chat provided an instant and cheaper communication path.

    This feature was, by far, one the reason we moved to Odoo in the first place.

    On technical side, the chat works by long polling the server. In our server this connection stays open for about 50s before being dropped (only to be reestablished a moment later).

    This means: a tab has always at least a connection open to the server.

  • As explained by Ilya Grigorik in High Performance Networking in Chrome, browsers make only so many connections to the same combination of scheme, host and port.

    In Chrome this number is 6. In Firefox this can be tuned up or down by tweaking the network.http.max-persistent-connections-per-server parameter, but the default is 6.

Combining these, it means that after opening about 6 tabs every AJAX call has to wait up to 50s to be able to even reach the server.

We opted for raising this number in Firefox and to warn users about this. Our user base is small enough and the server has yet to achieve any noticeable load.

Epilogue

That’s what our migration to Odoo story has gone so far. A little bit stressful at times, specially when the menus went south for the second time, but I think we have crossed the yellow line now. Should anything comes I’ll report it.

I’d love to hear (or read) about others, so I can learn from their experience.

Memento: microservices

Lately there’s been a raise of the debate about Microservices architectures. Reading about the stuff I realized I did once architect an application called Pyxel based on the same principles, though different.

Pyxel reached only its first prototype and was never actually deployed. I’d say it wasn’t given the chance to prove itself. So I argue it was technically sound and, although only a prototype and thus needing improvements, it would have meet its stated requirements [1].

I haven’t been given permission to freely distribute the source code. But the architecture was published in several public events, so I can freely describe it here.

Pyxel

Pyxel’s goal was simple:

Be a shared repository of photographs for the news media that wish to collaborate.

Pyxel required that images were tagged with IPTC metadata before uploading. Upon upload, the image went through several processes for extracting the metadata, and for producing web-friendly versions [2] for several standard dimensions before being published.

On the technical side we have also very hard constraints:

  • the system would have to be highly accessible.
  • it should run on spare machines; introducing a new machine to replace an old one should be possible with minimal interruption.

This constraints were the main driver that lead us to design what we called a distributed system. Each major component was designed to run in isolation of the rest.

Pyxel was split in several processes:

  • An image queue holding new images. This was kind of a FIFO queue over the network.
  • A catalog holding indexes for retrieving images based of queries. Queries were defined a syntax allowing from very simple queries aided with pattern recognition (PR) of names and dates, to very specific queries that make the PR mostly unused.
  • A “feature” index. This basically allowed to find similar images. There were two indexes, one based on Hu moments and the other based on wavelets.
  • Several image processing nodes, that take images from the queue and:
    • Extract IPTC metadata and put it the catalog
    • Extract image features to allow detection of similar images.
    • Produce “web versions” for the image. We always kept the original file, but to keep the design of the prototype simple there was an “as-is web version” that simply returned the same file as the input.
  • A web application that communicates with the index and the catalogs.
  • An FTP server that injects images to the queue using a FUSE mount point (drag-and-drop to the browser wasn’t common yet).

That were the main components and all of them ran independently. Of course for some services to perform at 100%, they required others to be working. The web application would not be able to perform a search if the catalog was down, or to upload new images if the queue was out of reach.

The principles stand

Microservices are “a take” on the same old issue about software composition. Conceptually Parnas still rules: must do modules. Names have changed and several rules to build better modules (components, services, etc) have been provided.

Exposition of the modules as services have also been promoted.

In the web, the modules have drifted away from the server into the client.

Microservices and Pyxel

Pyxel was not actually a microservice architecture as it is understood now. Deep down it used a central name-server that matched services names to network endpoints. We used Pyro for this prototype.

We did recognize that this was an accidental choice, a matter of convenience to have the prototype working as soon as possible.

We chose Pyro to test our prototype not to keep it forever if it performed poorly. And though our first tests were positive there was some slowness.

To accommodate for a possible replacement of Pyro we chose to provide our own “idiom” (in Python) to connect and communicate with services:

with service('pyxel.queue') as queue
    queue.method(*arguments)

That pattern was designed to have an appearance as simple as RPC. Later we found that RPC was under fire, but the project was being shut down, so we didn’t fix that.

Under current views Pyxel needs several changes. I can think of a couple:

  • Probably exposing the queue to the client-side of the web application instead sending the images first to the server-side of the web application.
  • Extract the user registration, authentication and authorization into a service.

Notes

[1]Actually one of the main causes of Pyxel being a failure was I could not make all participants agreed on the requirements.
[2]Current retina displays would make that thumbnails look like icons and bandwidth is also more likely to allow very big pictures to be friendly.

Hot-swapping Python modules. An experiment.

A new project I’m involved with will probably require dozens of servers running several thousands greenlets each. Top-level greenlets represent jobs and their children will be individual tasks those greenlets are coordinating/supervising.

This model, however prototypical, resembles that of the OTP in Erlang. A greenlet may be either a supervisor or a worker.

But there’s something missing in our platform that Erlang do have and that might yield huge benefits. You can change your Erlang code while the program is running.

Modulets. The idea

I asked myself if I could devise an import mechanism that would allow to update a Python module in a way that already-running greenlets stay unaffected but newly created ones use the new code.

To exemplify, let’s say a typical tasks is:

def receive_confirmation(message, who):
   from jobs.util.email import send_email
   from jobs.util.email import wait_reply
   # Assume both send_email and wait_reply switch away from the current
   # greenlet and only switch back after they are done.
   message = send_email(message, who)
   res = wait_reply(message)
   return res  # this will be sent to the parent greenlet

The servers start and hundreds of this task are launched in different jobs. Many of them are idle waiting for their replies. Users are happily getting their confirmation emails and replying to them (or ignoring them).

However, we start receiving lot of bounces in the postmaster inbox. Some users have entered a wrong email address. A change is in order.

In response, we change our implementation of send_email so that it does VERP to know which recipients’ address are bouncing, and also create a new job that involves receiving confirmation email when a new user registers.

We’d love to simply update our jobs.util package and be done with it like this:

$ source server-virtual-env/bin/activate
$ pip install -U jobs.util -i https://my.private.server/

New jobs will pick up the new version and the older jobs will keep working as if nothing would have changed.

That would be really nice. Such a hot-swap of Python modules per job is what I’m calling a “modulet”.

Currently I have a “working” yet very experimental and undertested implementation of such a mechanism in our experimental modulets branch in xoutil.

Modulets in xoutil

The current implementation is a very early proof of concept and not something you’d like to put outside your playground.

The file test_modulet.py is a simple script you may run to see it working. It simply creates a temporary module magic_module that has the get_magic function. This function returns a single value.

The test launches three greenlets that simply call the get_magic function and asserts it returns the “right” magic value. Between launches the module gets updated to return a different magic value, which is passed as an argument to the newly created greenlet.

A single run will print something like:

$ python test_modulet.py
Beginning with 3 in /tmp/tmp1d4rK5
Isolating 'magic_module' as '<greenlet.greenlet object at 0x7f21f4e8daf0>.magic_module'
Isolating 'magic_module' as '<greenlet.greenlet object at 0x7f21f4e8da50>.magic_module'
Isolating 'magic_module' as '<greenlet.greenlet object at 0x7f21f4daa7d0>.magic_module'
Passed. I have the right magic number 1002
Passed. I have the right magic number 1001
Passed. I have the right magic number 1000

If you comment the bootstrapping of modulets, then you’ll get something like:

$ python test_modulet.py
Beginning with 3 in /tmp/tmpeI1oYA
Wrong magic number
Traceback (most recent call last):
  File "test_modulet.py", line 49, in rootprog
    g.switch()
  File "test_modulet.py", line 31, in prog
    assert res == magic, "Expected %d but got %d." % (magic, res)
AssertionError: Expected 1002 but got 1000.
Wrong magic number
Traceback (most recent call last):
  File "test_modulet.py", line 49, in rootprog
    g.switch()
  File "test_modulet.py", line 31, in prog
    assert res == magic, "Expected %d but got %d." % (magic, res)
AssertionError: Expected 1001 but got 1000.
Passed. I have the right magic number 1000

Future work

Since we are at the very early stages of this project is not easy to predict if we’ll keep modulets in our platform. Probably a celery based solution be enough.

If we were to keep it, there are several things to improve:

  • The current mechanism pollutes the sys.modules with a copy of a module per top-level greenlet.

    In the current state, this is an ever-growing pile of modules that never erases those that are no longer used.

    This needs to be changed in several ways:

    The namespace we use to masquerade the modules need not be (and should not be) the repr of the greenlet object.

    For the purposes of isolating different versions of the same code we can either use the timestamp of the files, the version of the distribution, etc…

    Running a diesel server will quickly eat all your RAM unless this is changed.

    When a greenlet dies the only one informed is its parent. But we certainly don’t want jobs to mess with sys.modules to clean our own mess.

    This poses a challenge of its own and may be delegated outside xoutil itself.

    That being said, it’s likely that the calculation of the current namespace and how to dispose of unused modules will be extensions points of modulets.

  • Currently we have a black-list of modules that will never be isolated.

    Changes in those modules will required a restart to be noticed. Those modules are platform-level. They include xoutil itself, greenlet and the entire standard library (which is not expected to change unless you change Python).

    We can also allow white-listing. Both ways are on the table.

    The white-list imposes more explicit architecture of your platform since it requires throughout revision of which modules you’re willing to update on the run.

    Access to both lists will be a public API of the Modulet Manager. I can envision a remote-control console you’ll use to include a new module in the white-list. But that will be an application of the modulet API and included in the box.