Alfresco & The Curious Case of the PDF Killer
Some time ago, some devops guy observed 200% CPU consumption because of the Java process running Alfresco repository. The server crashed and after some minutes, the service was suddenly restored.
During one of these crisis, that devops got a Thread Dump to analyse.
$ kill -3 25041 $ tail -f -n 10000 /opt/alfresco/tomcat/logs/catalina.out java.lang.Thread.State: RUNNABLE at org.apache.pdfbox.pdmodel.common.PDStream.getFilters(PDStream.java:303) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject.commonXObjectCreation(PDXObject.java:152) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject.createXObject(PDXObject.java:122) at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:257) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:286)
There were many different Java Threads running PDFBox to extract contents and images from PDF files, which was coherent with a scenario provoked by a massive migration involving several GiBs and performed some days before. SOLR was still parsing the whole thing. Just to reduce the problem, the devops used 2 more CPUs. However, for a big surprise, the Java process started to consume… 400% CPU!
As SOLR was tracking Alfresco every 15 seconds and every tracker request was involving that PDF manipulation, the server was collapsing. In order to maintain the server up while the tracking task end, the devops take a decision:
- Change SOLR tracker to be executed every hour
- Change SOLR tracker threads to 1
$ vi /opt/alfresco/solr4/workspace-SpacesStore/conf/solrcore.properties alfresco.cron=0 0 * * * ? * alfresco.corePoolSize=1 alfresco.workQueueSize=10
But, even when the server was performing better, 400% CPU problem was still living.
After that, the devops started to study problems in SOLR indexes…
… and he found about 100,000 unindexed nodes.
As those nodes were unindexed, only a database query could be used to obtain details.
select * from alf_node, alf_node_properties where alf_node.id = alf_node_properties.node_id and alf_node.id = 2685025;
And he realised that many PDF files were not indexed in the last weeks.
On the other hand, he found relevant to verify that SOLR was ignoring these unindexed nodes, just to focus analysing efforts only in new contents.
$ tail -f /opt/alfresco/tomcat/logs/localhost_access_log.txt 192.168.30.5 - - [13/Dec/2017:00:00:03 +0100] "GET /alfresco/service/api/solr/transactions?fromCommitTime=1513123178428& toCommitTime=1513126778428&maxResults=2000&baseUrl=%2Fsolr4%2Falfresco%2F &hostName=localhost&template=&instance=0&numberOfShards=1&port=8080 &stores=workspace%3A%2F%2FSpacesStore&isMaster=true&hasContent=true&shardMethod=MOD_ACL_ID &lastUpdated=1513119602928&lastIndexedChangeSetCommitTime=1512991654833 &lastIndexedChangeSetId=1279059&lastIndexedTxCommitTime=1513115926431 &lastIndexedTxId=2181672 HTTP/1.1" 200 120 $ psql -U postgres # \c alfresco # select max(id) from alf_transaction; max --------- 2181672 (1 row)
He decided to DEBUG by using default Log4J messages for NodeContentGet.java but there were no useful details been logged by that Java Class.
Keeping calm, he developed an overwriting for the Java Class including many information to be analysed.
<beans> <bean id="webscript.org.alfresco.repository.solr.nodeContent.get" class="es.keensoft.solr.client.NodeContentGet" parent="webscript"> <property name="contentService" ref="contentService"/> <property name="nodeService" ref="nodeService"/> <property name="permissionService" ref="permissionService"/> <property name="nodeDAO" ref="nodeDAO"/> <property name="transformerDebug" ref="transformerDebug" /> <property name="delegate" ref="webscript.content.streamer" /> </bean> </beans>
Those log lines showed that PDFBox was trying to extract information from many PDF files several times, so the server was frozen for minutes.
He download one sample PDF to reproduce the issue on a test server. And he even was able to reproduce the behaviour with a simple standalone program.
Again, to reduce the problem while it was solved, he created a rule in the input folder to avoid new PDF content being indexed.
var props = new Array(); props["cm:isContentIndexed"] = false; document.addAspect("cm:indexControl", props);
He started examining PDF internal properties, but they didn’t seem to be special.
Every clue was indicating that the error was coming from PDFBox or Tika. Both Apache products are patched by Alfresco and the version used by Alfresco is quite old. He tried to adapt current versions of PDFBox and Tika without success. So he focused on debugging the problem in the standalone program. An infinite recursion was reached at some point, something similar to TIKA-1742
He compared the source code and adapted the lines from PDF2XHTML.java to solve the problem. After packaging the patch, he deployed it in Alfresco and… the problem went away!
He thought that this should be a common problem, so he joined Alfresco IRC to talk with the *gurus*. After that, he realised that the problem was unknown for Alfresco Community.
So he decided to make it public, just to prevent other devops in the world from suffering the same Calvary as him.
And they were happily ever after!
Using Open Source products is always worthy. Just ask to that devops!