Pharmaceutical firm uses Hadoop to crunch huge amounts of data so it can develop vaccines faster. One of eight profiles of InformationWeek Elite 100 Business Innovation Award winners.
Vaccines often contain attenuated viruses, meaning they’re altered so they give you immunity but not the actual disease, and thus they have to be handled under precise conditions during every step in the manufacturing process. Components might have to be stored at exactly -8 degrees for a year or more, and with even a slight variance from regulator-approved manufacturing processes, the materials have to be discarded.
“It might take three parts to get one part, and what we drop or discard amounts to hundreds of millions of dollars in lost revenue,” says George Llado, VP of information technology at Merck & Co.
In the summer of 2012, Llado was seeing higher-than-usual discard rates on certain vaccines. Llado’s team was looking into the causes of the low vaccine yield rates, but the usual investigative approach involved time-consuming spreadsheet-based analyses of data collected throughout the manufacturing process. Sources include process-historian systems on the shop floor that tag and track each batch. Maintenance systems detail plant equipment service dates and calibration settings. Building-management systems capture air pressure, temperature, and other readings in multiple locations at each plant, sampling by the minute.
Aligning all this data from disparate systems and spotting abnormalities took months using the spreadsheet-based approach, and storage and memory limits meant researchers could only look at a batch or two at a time. Jerry Megaro, Merck’s director of manufacturing advanced analytics and innovation, was determined to find a better way.
By early 2013, a Merck team was experimenting with a massively scalable distributed relational database. But when Llado and Megaro learned that Merck Research Laboratories (MRL) could provide their team with cloud-based Hadoop compute, they decided to change course.
Built on a Hortonworks Hadoop distribution running on Amazon Web Services, MRL’s Merck Data Science Platform turned out to be a better fit for the analysis because Hadoop supports a schema-on-read approach. As a result, data from 16 disparate sources could be used in analysis without having to be transformed with time-consuming and expensive ETL processes to conform to a rigid, predefined relational database schema.
“We took all of our data on one vaccine, whether from the labs or the process historians or the environmental systems, and just dropped it into a data lake,” says Llado.
Megaro’s team was then able to come up with conclusive answers about production yield variance within just three months. In the first month, July 2013, the team loaded the data onto a partition of the cloud-based platform, and it used MapReduce, Hive, and advanced dynamic time-warping techniques to aggregate and align the data sets around common metadata dimensions such as batch IDs, plant equipment IDs, and time stamps.
In the second month, analysts used R-based analytics to chart and cluster every batch of the vaccine ever made on a heat map. Spotting notable patterns, the team then used R to produce investigative histograms and scatter plots, and it drilled down with Hive to explore hypotheses about the factors tied to low-yield production runs. Using an Agile development approach, the team set up daily data-exploration goals, but it could change course by that afternoon if it failed to find solid data backing up a particular hypothesis. In the third month, the team developed models, testing against the trove of historical data to prove and disprove leading theories about yield factors.
Through 15 billion calculations and more than 5.5 million batch-to-batch comparisons, Merck discovered that certain characteristics in the fermentation phase of vaccine production were closely tied to yield in a final purification step. “That was pretty powerful, and we came up with a model that demonstrated, quantifiably, that specific fermentation performance traits are very important to yield,” says Megaro.
The good news is that these fermentation traits can be controlled, but Merck has to prove that in a test lab before IT can introduce any changes to its production environment. And if any process changes are deemed material, Merck will have to refile the vaccine’s manufacturing process with regulatory agencies.
With the case all but solved for one vaccine, Merck is applying the lessons learned to a variant of that product that is expected to be approved for sale as soon as this year. And drawing on both the manufacturing insights and the new big data analysis approach, Merck intends to optimize the production of other vaccines now in development. They’re all potentially lifesaving products, according to Merck, and it’s clear that the new data analysis approach marks a huge advance in ensuring efficient manufacturing and a more plentiful supply.