Tech Articles
Notes on tech articles I'm reading.
The mythical 10x programmer (read: 10/10/2021)
Caches, Modes, and Unstable Systems (read: 10/2/2021)
Very cool.
"Most real systems like this have a congestive collapse mode, where they can't get rid of requests as fast as they arrive, concurrency builds up, and the goodput drops, making the issue worse. You can use tools like Little's law to think about those situations."
"So our system has two stable loops. One's a happy loop where the cache is full:"... "The other is a sad loop, where the cache is empty, and stays empty:".
"Load testing typically isn't enough to kick a system in the good loop into the bad loop, and so may not show that the bad loop exists. This is for a couple of reasons. One is that caches love load, and typically behave better under high, predictable, well-behaved load than under normal circumstances. The other is that load tests typically test lots of load, instead of testing the bad pattern for caches, which is load with a different (and heavier-tailed) key frequency distribution from the typical one."
How Bash completion works (read: 9/26/2021)
Interface: a function that "accepts" env variables, "returns" another.
This was good to know!
Metastability and Distributed Systems (read: 9/11/2021)
"There's no more time-honored way to get things working again, from toasters to global-scale distributed systems, than turning them off and on again."
"Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed."
"We consider the root cause of a metastable failure to be the sustaining feedback loop, rather than the trigger. There are many triggers that can lead to the same failure state, so addressing the sustaining effect is much more likely to prevent future outages."
Retries: "If you're only looking at your day-to-day error rate metric, you can be lead to believe that adding more retries makes systems better because it makes the error rate go down. However, the same change can make systems more vulnerable, by converting small outages into sudden (and metastable) periods of internal retry storms. Your weekly loop where you look at your metrics and think about how to improve things may be making things worse."
Gateway (read: 9/5/2021)
Very common thing.
Reminds me of SAOs at Amazon (Service Access Object).
"I use a gateway whenever I access some external software and there is any awkwardness in that external element. Rather than let the awkwardness spread through my code, I contain to a single place in the gateway."
I love how Fowler writes.
"At that time I struggled whether to coin a new pattern name as opposed to referring to the existing Gang of Four patterns: Facade, Adapter, and Mediator. In the end I decided that there was enough of a difference that it was worth a new name."
"While Facade simplifies a more complex API, it's usually done by the writer of the service for general use. A gateway is written by the client for its particular use."
"Adapter is the closest GoF pattern to the gateway as it alters an class's interface to match another. But the adapter is defined in the context of both interfaces already being present, while with a gateway I'm defining the gateway's interface as I wrap the foreign element. That distinction led me to treat gateway as a separate pattern. Over time people have used "adapter" much more loosely, so it's not unusual to see gateways called adapters."
Eclipse - AOSA book (read: 9/2/2021)
Redundant against what? (read: 8/28/2021)
Okay.
Cost-Efficient Open Source Big Data Platform at Uber (read: 8/22/2021)
Better compression.
Delete unnecessary columns.
"Row order can dramatically affect the size of compressed Parquet files. This is due to both the Run-Length Encoding feature inside Parquet format, as well as the compression algorithm’s capability to take advantage of local repeats. We examined a list of the largest Hive tables at Uber, and performed manually-tuned ordering that reduces the table sizes by more than 50%. A common pattern that we found is simply to order the rows by user ID, and then timestamp for the log tables. Most log tables have user ID and timestamp columns. This allows us to compress many denormalized columns associated with the user ID extremely well."
Whoa.
The rest is rather hard to follow.
Challenges and Opportunities to Dramatically Reduce the Cost of Uber’s Big Data (read: 8/14/2021)
Ok.
Zero-Overhead Tree Processing with the Visitor Pattern (read: 8/7/2021)
"The Visitor Pattern gives you flexible, streaming, zero-overhead processing of complex data structures."
Wow this is a really awesome article.
Hybrid Clock (read: 8/1/2021)
"Hybrid Logical Clock provides a way to have a version which is monotonically increasing just like a simple integer, but also has relation with the actual date time. Hybrid clocks are used in practice by databases like mongodb or cockroachdb."
Fancy.
namedtuple in a post-dataclasses world (read: 8/1/2021)
I knew neither about data classes nor about named tuples lol. Shame.
Well-written.
A Deep Dive into Airbnb’s Server-Driven UI System (read: 7/24/2021)
Problems of client-driven UI:
"there’s listing-specific logic built on each client to transform and render the listing data. This logic becomes complicated quickly and is inflexible if we make changes to how listings are displayed down the road." Don't quite get it.
"Second, each client has to maintain parity with each other. As mentioned, the logic for this screen gets complicated quickly and each client has their own intricacies and specific implementations for handling state, displaying UI, etc. It’s easy for clients to quickly diverge from one another." Fair.
"Finally, mobile has a versioning problem. Each time we need to add new features to our listing page, we need to release a new version of our mobile apps for users to get the latest experience. Until users update, we have few ways to determine if users are using or responding well to these new features." Mkay.
Interesting stuff.
Why not just use web apps then?
Write a time-series database engine from scratch (read: 7/17/2021)
Very insightful.
I should implement something like this some day.
The whole API is
InsertRows
andSelect
?
Versioned Value (read: 7/11/2021)
Skip lists! It's the first time I see them being used anywhere.
Mvcc = Multiversion concurrency control.
Nice article.
Handling Flaky Unit Tests in Java (read: 7/11/2021)
The oldest topics there is :-)
Test Analyzer tool - I've seen this before :-)
"Therefore, to enable any developer to triage flaky failures, we built dynamic reproducer tools which can be used to reproduce the failure locally." - how?
Ok found it:
Run just the input test
Run all the tests in the input test class
Run all the tests in the test target
Run the test under port collision detection mode.
Repeat steps 1 – 3 while increasing the resource load on the system
Supercharging Application Delivery (read: 7/7/2021)
"customers should be able to adopt, customize and evolve best practices and technologies for delivering their modern applications to the cloud, and not worry about how they roll this out – potentially to thousands of developers – across their organization."
Does not sound very amazonian. People there do care very much about how things are rolled out.
Next gen of infrastructure-as-code?
I still don't quite understand what exactly it does. Umbrella thingie for CloudFormation + CodeDeploy + CodePipelines?
It’s Officially Startup Season in Space (read: 7/7/2021)
Space is cool. I wonder if I can do something in this space.
The article is meh.
On the Diverse And Fantastical Shapes of Testing (read: 6/27/2021)
The definition of "unit" is vague. Can be "social" (can use other units) or "solitary" (everything is mocked out).
I for one like integration tests.
"The take-away here is when anyone starts talking about various testing categories, dig deeper on what they mean by their words, as they probably don't use them the same way as the last person you read did."
Why (and how) GitHub is adopting OpenTelemetry (read: 6/27/2021)
Common tracing is good, naturally.
I wish they explained the format a little.
Where/how are they stored?
It's probably time to stop recommending Clean Code (read: 6/19/2021)
Makes sense for the most part.
Indeed some Uncle Bob's code is questionable.
The maxima of tiny functions with no params always seemed bad to me.
But Clean Code still contains a ton of solid advice.
There must be some common reference. Everyone doing it their own way won't work.
Diving Deep on S3 Consistency (read: 6/13/2021)
He starts with describing analytical workflows, etc, but surely no one cares about strong consistency there.
If customers had to build their own solutions with Dynamo, etc. to track S3 consistency, it's really sad.
Hard to follow the actual design part.
I don't think it can be called "deep dive".
A new era of DevOps, powered by machine learning (read: 6/13/2021)
"Although DevOps technology has evolved dramatically over the last 5 years, it is still challenging. Issues related to concurrency, security or handling of sensitive information require expert evaluation and often slip through existing mechanisms like peer code reviews and unit testing."
What does this have to do with DevOps?
I wonder what Werner actually thinks of CodeGuru, not the sales pitch.
I'm not sure about DevOps Guru. Can you really "trust" it to monitor for you? I would be very scared to use it in place of some hand-picked thresholds, etc.
Easily build real-time apps with WebSockets and Azure Web PubSub—now in preview (read: 6/6/2021)
Seems pretty cool. If I was building a realtime app, I would seriously consider using it.
A brief history of Rust at Facebook (read: 6/6/2021)
https://news.ycombinator.com/item?id=26982879
A lot of discussion if there are/will be good jobs in Rust.
"When corruption or downtime can potentially bring services to a halt, reliability is a top priority. That’s why the team chose to go with Rust over C++."
It's always strange to hear the expectation of strong influence of the language on reliability. Testing, monitoring, all that - that seems to matter so much more.
Chilling Tales from Reddit Engineering (read: 5/29/2021)
Don't deploy stuff on Fridays and before holidays.
“It works on my machine!” she declared, shipping the code to production." Then "With haste, the young engineer rolled back the Firebase update."
MCM anyone?
"Only little did he know, a button that read ‘Save Policy,’ hid below the fold and his termination policy made no impact."
Who's updating prod infrastructure through the AWS UI? CloudFormation/Terraform FTW. Or at very least use the CLI.
Scaling Reporting at Reddit (read: 5/29/2021)
Old system: pre-aggregate things offline, one row per advertiser/day.
Need to code more complex queries.
"This system also had issues with memory usage. Queries that had to fetch many keys from Redis would cause our reporting service to deserialize large quantities of Thrift data. Deserializing all of this Thrift data was slow and required a lot of memory. We had to over-provision the reporting service to anticipate large queries. This system resulted in a degraded experience as our advertisers got larger."
"Another issue with the system was that it wasn’t flexible. Adding new breakdown capabilities usually meant adding new pre-aggregates before inserting the data into Redis. We would then have to implement corresponding querying logic in the reporting service. Adding new fields also required a lot of work to wire them through the entire pipeline. It was clear that our data store was not nearly as flexible as the products that we needed to support."
Sounds familiar!
Offloaded aggregation to Druid.
"A Spark job validates incoming events and places them into Amazon S3 as parquet files.
Another Spark job performs minor transformations on these parquet files to make them appropriate for Druid to ingest."
Why not just one job then? What's the point of these intermediate files in S3?
"The largest benefits of this migration can be seen in our availability and latency graphs. As you can see below, the new reporting service (blue) is consistently more available than the legacy system (green). Our legacy system struggled to maintain 99.5% availability at times while our new service is generally able to maintain 99.9% availability."
Huh, they use some different definition of availability then. Just success/all, rather than minutes of unavailability, etc.
In Slack, no one can hear your scream! (read: 5/23/2021)
Okay. Not sure why I read it.
Git hash function transition (read: 5/15/2021)
I wish I knew git internals better to understand what's going on.
The consinceness of the language, the clarity and level of detail are amazing.
Discussion of candidates: https://lore.kernel.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/
Kind of personal rant by Linus: https://lore.kernel.org/git/CA+55aFy_OJPMFbzMfN9yKwdGsx-8FZ0v_zt-d+xCN3KSCqdB9w@mail.gmail.com/
Facebook - How Facebook encodes your videos (read: 5/9/2021)
"From a pure computing perspective, applying the most advanced codecs to every video uploaded to Facebook would be prohibitively inefficient. Which means there needs to be a way to prioritize which videos need to be encoded using more advanced codecs."
ML to predict which videos are going to be highly watched.
"But this task isn’t as straightforward as allowing content from the most popular uploaders or those with the most friends or followers to jump to the front of the line. There are several factors that have to be taken into consideration so that we can provide the best video experience for people on Facebook while also ensuring that content creators still have their content encoded fairly on the platform."
Benefit = (relative compression efficiency of the encoding family at fixed quality) * (effective predicted watch time)
Cost = normalized compute cost of the missing encodings in the family
Priority = Benefit/Cost
Not every device can play every codec.
"The best indicator of next-hour watch time is its previous watch time trajectory."
Duh.
"Improvements in ML metrics do not necessarily correlate directly to product improvements. Traditional regression loss functions, such as RMSE, MAPE, and Huber Loss, are great for optimizing offline models. But the reduction in modeling error does not always translate directly to product improvement, such as improved user experience, more watch time coverage, or better compute utilization."
Huh.
"we decided to build two models, one for handling upload-time requests and other for view-time requests"
Martin Fowler - Bitemporal History (read: 5/9/2021)
There's a property that can change, and we want to track its history. But sometimes we receive updates about the past.
So we keep a table with "record" dates and "actual" dates.
We read it as "on <record date>, we thought that on <actual date> the value was <value>".
"In programming terms, If I want to know Sally's salary, and I have no history, then I can get it with something like sally.salary. To add support for (actual) history I need to use sally.salaryAt('2021-02-25'). In a bitemporal world I need another parameter sally.salaryAt('2021-02-25', '2021-03-25')"
"One way to avoid it is to not support retroactive changes. If your insurance company says any changes become in force when they receive your letter - then that's a way of forcing actual time to match record time."
"One of the hardest parts of this is educating users on how bitemporal history works. Most people don't think of a historical record as something that changes, let alone of the two dimensions of record and actual history."
"Bitemporal history is a way of coming to terms that communication is neither perfect nor instantaneous. Actual history is no longer append-only, we go back and make retroactive changes. However record history itself is append only."
Our Journey Towards Cloud Efficiency
Too much corporate speak.
Utilize spot.
Eliminate waste.
S3 - use data retention policies.
S3 - use more cost effective tiers.
Be careful when storing small files in Glacier.
Compute - use K8s for auto scaling.
I wonder if stuff like AWS Config or Trusted Advisor recommend these?
Track usage.
Attribute usage.
Use reserved instances for RDS and ElastiCache.
"Culture of Cost Awareness".
"The changes stemmed from better contract management and utilization of our third-party cloud services." Wat?
How we sped up Dropbox Android app startup by 30%
Load time seems sort of constant on 2-week interval. Need to look at bigger intervals.
Need to measure more granularly, for each step of the startup. Identify biggest offenders with this.
"The major app startup offenders included Firebase Performance library initialization, feature flag migration, and initial user loading."
"In our debugging, we discovered that Firebase suite initialization was seven times longer when Firebase Performance tool was enabled. To fix the performance issue, we chose to remove the Firebase Performance tool from the Android Dropbox application."
Hmm, can't they load it in background or something, without blocking users? And start using it when it's ready?
"In the legacy part of our application, we store Dropbox user contacts metadata on the device as JSON blobs. In an ideal world, those JSON blobs should be read and converted into Java objects only once. Unfortunately, the code to extract users was getting called multiple times from different legacy features of the app, and each time, the code would perform expensive JSON parsing to convert user JSON blobs into Java objects."
How slow can it possibly be? How big are these blobs?
Detecting memory leaks in Android applications
They describe Android-specific memory leak patterns.
Android-specific memory leak finding lib: https://square.github.io/leakcanary/. Can upload found leaks.
Can hook LeakCanary to integ tests.
Packaging award-winning shows with award-winning technology
4/25/2021
So there's a codec-agnostic open packaging format for tranferring videos. Ok.
4/19/2021
"Orchestrated Functions as a Microservice" lol.
It's interesting that these video encoding flows take days to complete. I wonder what SLO they need to provide for stuff like this. On one hand, availability should not be that big of a deal because it's async and not directly customer facing, but retries can be very costly. Can it, say, delay a release of a title?
But do they actually ever need to retry the entire thing?
It takes a confident person to call something internal Optimus.
So many details about internal systems, all with creative names. No way I will (or want to) follow what they are all doing.
Oh wait, some of these video processing flows are user-facing apparently. And latency matters.
A Day in the Life of an Experimentation and Causal Inference Scientist @ Netflix
4/19/2021
No takeways.
Last updated