Alfresco & The Curious Case of the PDF Killer

by / Thursday, 21 December 2017 / Published in Blog
Tinkering Fridays come back again!
Alfresco Supported Platforms

Some time ago, some devops guy observed 200% CPU consumption because of the Java process running Alfresco repository. The server crashed and after some minutes, the service was suddenly restored.

During one of these crisis, that devops got a Thread Dump to analyse.

$ kill -3 25041

$ tail -f -n 10000 /opt/alfresco/tomcat/logs/catalina.out

java.lang.Thread.State: RUNNABLE
at org.apache.pdfbox.pdmodel.common.PDStream.getFilters(PDStream.java:303)
at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject.commonXObjectCreation(PDXObject.java:152)
at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject.createXObject(PDXObject.java:122)
at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:257)
at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:286)

There were many different Java Threads running PDFBox to extract contents and images from PDF files, which was coherent with a scenario provoked by a massive migration involving several GiBs and performed some days before. SOLR was still parsing the whole thing. Just to reduce the problem, the devops used 2 more CPUs. However, for a big surprise, the Java process started to consume… 400% CPU!

As SOLR was tracking Alfresco every 15 seconds and every tracker request was involving that PDF manipulation, the server was collapsing. In order to maintain the server up while the tracking task end, the devops take a decision:

  • Change SOLR tracker to be executed every hour
  • Change SOLR tracker threads to 1
$ vi /opt/alfresco/solr4/workspace-SpacesStore/conf/solrcore.properties

alfresco.cron=0 0 * * * ? *
alfresco.corePoolSize=1
alfresco.workQueueSize=10

But, even when the server was performing better, 400% CPU problem was still living.

After that, the devops started to study problems in SOLR indexes…

https://localhost:8443/solr4/alfresco/afts?q=DOC_TYPE:UnindexedNode
https://localhost:8443/solr4/alfresco/afts?q=DOC_TYPE:ErrorNode

… and he found about 100,000 unindexed nodes.

As those nodes were unindexed, only a database query could be used to obtain details.

select *
from alf_node, alf_node_properties
where alf_node.id = alf_node_properties.node_id and
alf_node.id = 2685025;

And he realised that many PDF files were not indexed in the last weeks.

On the other hand, he found relevant to verify that SOLR was ignoring these unindexed nodes, just to focus analysing efforts only in new contents.

$ tail -f /opt/alfresco/tomcat/logs/localhost_access_log.txt

192.168.30.5 - - [13/Dec/2017:00:00:03 +0100] "GET /alfresco/service/api/solr/transactions?fromCommitTime=1513123178428&
toCommitTime=1513126778428&maxResults=2000&baseUrl=%2Fsolr4%2Falfresco%2F
&hostName=localhost&template=&instance=0&numberOfShards=1&port=8080
&stores=workspace%3A%2F%2FSpacesStore&isMaster=true&hasContent=true&shardMethod=MOD_ACL_ID
&lastUpdated=1513119602928&lastIndexedChangeSetCommitTime=1512991654833
&lastIndexedChangeSetId=1279059&lastIndexedTxCommitTime=1513115926431
&lastIndexedTxId=2181672 HTTP/1.1" 200 120

$ psql -U postgres
# \c alfresco
# select max(id) from alf_transaction;
max
---------
2181672
(1 row)

He decided to DEBUG by using default Log4J messages for NodeContentGet.java but there were no useful details been logged by that Java Class.

Keeping calm, he developed an overwriting for the Java Class including many information to be analysed.

<beans>
   <bean id="webscript.org.alfresco.repository.solr.nodeContent.get"
         class="es.keensoft.solr.client.NodeContentGet"
         parent="webscript">
      <property name="contentService" ref="contentService"/>
      <property name="nodeService" ref="nodeService"/>
      <property name="permissionService" ref="permissionService"/>
      <property name="nodeDAO" ref="nodeDAO"/>
      <property name="transformerDebug" ref="transformerDebug" />
      <property name="delegate" ref="webscript.content.streamer" />
   </bean>
</beans>

Those log lines showed that PDFBox was trying to extract information from many PDF files several times, so the server was frozen for minutes.

He download one sample PDF to reproduce the issue on a test server. And he even was able to reproduce the behaviour with a simple standalone program.

Again, to reduce the problem while it was solved, he created a rule in the input folder to avoid new PDF content being indexed.

var props = new Array(); 
props["cm:isContentIndexed"] = false; 
document.addAspect("cm:indexControl", props);

He started examining PDF internal properties, but they didn’t seem to be special.

Producer TCPDF 6.2.5 (http://www.tcpdf.org/)
CreationDate Thu Oct 26 18:54:00 2017
ModDate Thu Oct 26 18:54:00 2017
Tagged no
UserProperties no
Suspects no
Form none
JavaScript no
Pages 7
Encrypted no
Page size 595 x 842 pts (A4)
Page rot 0
File size 2025590 bytes
Optimized no
PDF version 1.5

Every clue was indicating that the error was coming from PDFBox or Tika. Both Apache products are patched by Alfresco and the version used by Alfresco is quite old. He tried to adapt current versions of PDFBox and Tika without success. So he focused on debugging the problem in the standalone program. An infinite recursion was reached at some point, something similar to TIKA-1742

He compared the source code and adapted the lines from PDF2XHTML.java to solve the problem. After packaging the patch, he deployed it in Alfresco and… the problem went away!

He thought that this should be a common problem, so he joined Alfresco IRC to talk with the *gurus*. After that, he realised that the problem was unknown for Alfresco Community.

So he decided to make it public, just to prevent other devops in the world from suffering the same Calvary as him.

And they were happily ever after!

Using Open Source products is always worthy. Just ask to that devops!

Responsable tecnológico para soluciones de gestión documental en keensoft. Especialista en Alfresco y en implantaciones de Administración Electrónica.

Tagged under:

Leave a Reply

TOP