【1.2】解析pdf--pdf2htmlEX
最近市场部的反馈,需要提取别家公司产品的信息,用于做比较。当然,你晓得啦,网页版的可以用scrapy来爬取,再通过beautifulsoap来解析Html内容,那pdf肿么办?当然是先转化为html罗
这个事情要分三步走:
- 解密pdf
- pdf转化为Html
- html的解析,提取想要的信息
一、 在线解密pdf
https://smallpdf.com/cn/unlock-pdf
二、pdf转化为html
1. pdf2htmlex简介
三、 安装
官方文档:https://github.com/coolwanglu/pdf2htmlEX/wiki/Building
3.1 Mac上的安装
Mac OS X可以使用brew来安装
brew install pdf2htmlEX
(这个工具依赖的包实在是太多啦,太多啦,要是手动去安装,会崩溃的)
安装报错1:
==> Downloading ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.18.tar.xz
curl: (78) RETR response: 550
Error: Failed to download resource "libpng"
Download failed: ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.18.tar.xz
解决办法:
需要翻墙下载:
ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.30.tar.xz
cd /Users/tanqianshan/Documents/project/8.pdf_convert/lib
wget -c ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.30.tar.xz
tar xvJf libpng-1.6.30.tar.xz
./configure --prefix=/usr
make check
make install
安装报错2:
重新安装,仍旧报错:
tanqianshan[2.其他公司表型整理]$ brew install pdf2htmlEX
Warning: You are using OS X 10.12.
We do not provide support for this pre-release version.
You may encounter build failures or other breakage.
Error: You must `brew link cmake' before pdf2htmlex can be installed
解决办法:
brew unistall cmake
sudo brew install cmake
brew install pdf2htmlEX
手动安装 pdf2htmlEX
需要提前安装好的软件
1.poppler
方法一:(失败)
pip install poppler
Could not find a version that satisfies the requirement poppler (from versions: )
方法二:
cd /Users/tanqianshan/Documents/project/8.pdf_convert/lib/
wget https://poppler.freedesktop.org/poppler-0.56.0.tar.xz
tar xvJf libpng-1.6.30.tar.xz
cd poppler-0.56.0
./configure --prefix=/usr
安装 pdf2htmlEX
cd /Users/tanqianshan/Documents/project/8.pdf_convert/lib
git clone git://github.com/coolwanglu/pdf2htmlEX.git
cd pdf2htmlEX
cmake . && make && sudo make install
最后的解决办法:
暂时不安装libpng
brew install pdf2htmlEX --without-libpng
3. 运行pdf2htmlEX
pdf2htmlEX --zoom 1.3 boao_aishenpu.pdf
3.2 centos7 上的安装 (安装的都想哭)
3.2.1 安装依赖的各种库
因为各种报错,所以查了不少安装方式,也分不清,哪些库是有必要的,哪些库是没有必要的,索性都给安装上了。
yum-config-manager –enable epel
yum -y update
安装key
cd /etc/pki/rpm-gpg/
wget http://mirrors.163.com/centos/RPM-GPG-KEY-CentOS-7
rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
cd /etc/yum.repos.d
wget -c http://linuxsoft.cern.ch/cern/scl/slc6-scl.repo
升级pip
pip install --upgrade pip
pip install --upgrade setuptools
pip install lxml
yum -y install libtool-ltdl-devel.x86_64 zlib-devel.x86_64 glib2-devel.x86_64 freetype-devel.x86_64 poppler-glib-devel.x86_64 git cmake mk-configure.noarch libjpeg-turbo.x86_64 libtiff.x86_64 libpng-devel.x86_64 giflib-devel.x86_64 libXt-devel.x86_64 autoconf automake libtool bzip2 libxml2.x86_64 libuninameslist-devel.x86_64 libspiro.x86_64 dbus-python-devel.x86_64 pango-devel.x86_64 chrpath uuid-c++.x86_64 uuid.x86_64 uthash-devel.noarch cmake gcc java-1.8.0-openjdk libpng-devel.x86_64 fontforge-devel.x86_64 cairo-devel.x86_64 poppler-devel.x86_64 libspiro-devel.x86_64 freetype-devel.x86_64 poppler-data libjpeg-turbo-devel git gcc-c++ libjpeg-turbo-devel.x86_64 poppler-data.noarch jpackage-utils.noarch gettext.x86_64 jpackage-utils.noarch python27-python-devel.x86_64 libxml2-python27.x86_64 libxml2-python26.x86_64 python27-python-devel.x86_64 libxslt-devel.x86_64 libxslt-python26.x86_64 libxslt.x86_64 libxml2-devel libxslt-devel python-devel python-javapackages.noarch –nogpgcheck install poppler-cpp.x86_64 poppler-cpp-devel.x86_64 libstdc++48-static.x86_64 openjpeg-devel.x86_64
yum install cmake gcc gcc-c++ gtk+-devel gimp-devel gimp-devel-tools gimp-help-browser zlib-devel libtiff-devel libjpeg-devel
libpng-devel gstreamer-devel libavc1394-devel libraw1394-devel libdc1394-devel jasper-devel jasper-utils swig python libtool nasm
yum -y install libtool-ltdl-devel.x86_64 zlib-devel.x86_64 glib2-devel.x86_64 freetype-devel.x86_64 poppler-glib-devel.x86_64 git cmake mk-configure.noarch libjpeg-turbo.x86_64 libtiff.x86_64 libpng-devel.x86_64 giflib-devel.x86_64 libXt-devel.x86_64 autoconf automake libtool bzip2 libxml2.x86_64 libuninameslist-devel.x86_64 libspiro.x86_64 dbus-python-devel.x86_64 pango-devel.x86_64 chrpath uuid-c++.x86_64 uuid.x86_64 uthash-devel.noarch cmake gcc java-1.8.0-openjdk libpng-devel.x86_64 fontforge-devel.x86_64 cairo-devel.x86_64 poppler-devel.x86_64 libspiro-devel.x86_64 freetype-devel.x86_64 poppler-data libjpeg-turbo-devel git gcc-c++ libjpeg-turbo-devel.x86_64 poppler-data.noarch jpackage-utils.noarch gettext.x86_64 jpackage-utils.noarch python27-python-devel.x86_64 libxml2-python27.x86_64 libxml2-python26.x86_64 python27-python-devel.x86_64 libxslt-devel.x86_64 libxslt-python26.x86_64 libxslt.x86_64 libxml2-devel libxslt-devel python-devel python-javapackages.noarch –nogpgcheck install poppler-cpp.x86_64 poppler-cpp-devel.x86_64 libstdc++48-static.x86_64 openjpeg-devel.x86_64
yum install autotools-dev libjpeg-dev libtiff4-dev libpng12-dev libgif-dev libxt-dev autoconf automake libtool bzip2 libxml2-dev libuninameslist-dev libspiro-dev python-dev libpango1.0-dev libcairo2-dev chrpath uuid-dev uthash-dev
yum install cmake gcc gnu-getopt java-1.8.0-openjdk libpng-devel fontforge-devel cairo-devel poppler-devel libspiro-devel freetype-devel poppler-data libjpeg-turbo-devel git make gcc-c++
3.2.2 安装其他的库
安装openjpeg
wget -c https://sourceforge.net/projects/openjpeg.mirror/files/2.1.0/openjpeg-2.1.0.tar.gz/download?use_mirror=nchc
mv download?use_mirror=nchc openjpeg-2.1.0.tar.gz
tar -xzf openjpeg-2.1.0.tar.gz;
cd openjpeg-2.1.0
cmake . && make && make install
安装poppler
cd /home/sam/anbank-web/lib
wget -c http://poppler.freedesktop.org/poppler-0.35.0.tar.xz
tar -xf poppler-0.35.0.tar.xz
cd poppler-0.35.0/
./configure --prefix=/usr -enable-xpdf-headers -enable-libjpeg
make && make install
export LD_LIBRARY_PATH=/usr/lib
export LD_LIBRARY_PATH=/usr/lib64
export LD_RUN_PATH=/usr/lib
export LD_RUN_PATH=/usr/lib64
gcc版本
gcc -v
安装fontforge
正确流程:
cd /home/sam/anbank-web/lib
git clone https://github.com/coolwanglu/fontforge.git fontforge.git
cd fontforge.git
git checkout pdf2htmlEX
./autogen.sh
./configure --enable-debug --prefix=/usr
make V=1 # 报错
make install
fontforge -version
cp fontforge.pc /usr/local/lib/pkgconfig/
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
vim CMakeLists.txt
#adjust version
export LD_LIBRARY_PATH=/usr/local/lib
export LIBRARY_PATH=/usr/local/lib
安装过程中的问题与解决办法:
cd /home/sam/anbank-web/lib
git clone https://github.com/coolwanglu/fontforge.git fontforge.git
cd fontforge.git
./autogen.sh
报错1:
at least version 1.6.0 of GNU Autoconf must be installed
解决办法:
yum install autoconf
报错2:
at least version 1.6.0 of GNU Automake must be installed
解决办法:
yum install automake
报错3:
at least version 1.4.2 of GNU Libtool must be installed
解决办法:
yum install libtool
报错4:
ibtoolize: `COPYING.LIB' not found in `/usr/share/libtool/libltdl'
解决办法:
yum install libtool-ltdl-devel
问题报错都解决了,接着
./autogen.sh --verbose
./configure --prefix=/usr
报错5:
configure: error: Package requirements (pango >= 1.10 pangoxft) were not met:
No package 'pango' found
No package 'pangoxft' found
Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.
Alternatively, you may set the environment variables PANGO_CFLAGS
and PANGO_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details.
解决办法:
yum install pango pango-devel
make;
报错6:
ufo.c:925:12: error: conflicting types for 'SplinePointListInterpretGlif'
解决办法:
如上的各种yum,安装各种包。。
make install的提示
Libraries have been installed in:
/usr/local/lib64/python2.7/site-packages
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
export LIBDIR=/usr/local/lib64/python2.7/site-packages
3.2.3 安装 pdf2htmlEX
cd /home/sam/anbank-web/lib
git clone git://github.com/coolwanglu/pdf2htmlEX.git
cd pdf2htmlEX
cmake . && make && make install
# cmake -DCMAKE_BUILD_TYPE=Debug 则是创建debug模式
pkg-config –print-provides –cflags –libs poppler
报错:
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libfontforge.so:对‘PyTuple_GetItem’未定义的引用
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libfontforge.so:对‘PyImport_AppendInittab’未定义的引用
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libfontforge.so:对‘PyUnicodeUCS4_AsUTF8String’未定义的引用
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libfontforge.so:对‘PyString_Decode’未定义的引用
collect2: 错误:ld 返回 1
make[2]: *** [pdf2htmlEX] 错误 1
make[1]: *** [CMakeFiles/pdf2htmlEX.dir/all] 错误 2
make: *** [all] 错误 2
报错原因:
/usr/lib64/libfontconfig.so.1 与 libfontconfig.so.2 冲突
解决办法:
[root@localhost lib64]# ll |grep libfont
lrwxrwxrwx. 1 root root 22 9月 25 17:53 libfontconfig.so -> libfontconfig.so.1.7.0
lrwxrwxrwx. 1 root root 22 4月 9 03:38 libfontconfig.so.1 -> libfontconfig.so.1.7.0
-rwxr-xr-x. 1 root root 255968 8月 2 2017 libfontconfig.so.1.7.0
lrwxrwxrwx. 1 root root 21 4月 9 03:42 libfontembed.so.1 -> libfontembed.so.1.0.0
-rwxr-xr-x. 1 root root 53224 8月 3 2017 libfontembed.so.1.0.0
lrwxrwxrwx. 1 root root 19 4月 9 03:38 libfontenc.so.1 -> libfontenc.so.1.0.0
-rwxr-xr-x. 1 root root 27512 8月 2 2017 libfontenc.so.1.0.0
lrwxrwxrwx. 1 root root 21 9月 25 17:53 libfontforge.so -> libfontforge.so.1.0.0
lrwxrwxrwx. 1 root root 21 9月 25 17:53 libfontforge.so.1 -> libfontforge.so.1.0.0
-rwxr-xr-x. 1 root root 4214600 6月 10 2014 libfontforge.so.1.0.0
lrwxrwxrwx. 1 root root 32 9月 29 09:24 libfontforge.so.2 -> /usr/local/lib/libfontforge.so.2
ln -s /usr/local/lib/libfontforge.so.2 /usr/lib64/libfontforge.so.2
ln -s /usr/local/lib/libfontforge.so.2 /usr/lib64/libfontforge.so
ln -s /usr/local/lib/libpoppler.so.54 /usr/lib64/libpoppler.so.54
接着报错:
MakeFiles/pdf2htmlEX.dir/3rdparty/poppler/git/CairoFontEngine.cc.o:在函数‘CairoFreeTypeFont::create(GfxFont*, XRef*, FT_LibraryRec_*, bool)’中:
/home/sam/anbank-web/lib/pdf2htmlEX/3rdparty/poppler/git/CairoFontEngine.cc:425:对‘GfxFont::locateFont(XRef*, PSOutputDev*)’未定义的引用
CMakeFiles/pdf2htmlEX.dir/src/HTMLRenderer/font.cc.o:在函数‘pdf2htmlEX::HTMLRenderer::install_font(GfxFont*)’中:
/home/sam/anbank-web/lib/pdf2htmlEX/src/HTMLRenderer/font.cc:889:对‘GfxFont::locateFont(XRef*, PSOutputDev*)’未定义的引用
CMakeFiles/pdf2htmlEX.dir/src/HTMLRenderer/font.cc.o:在函数‘pdf2htmlEX::HTMLRenderer::install_external_font(GfxFont*, pdf2htmlEX::FontInfo&)’中:
/home/sam/anbank-web/lib/pdf2htmlEX/src/HTMLRenderer/font.cc:944:对‘GfxFont::locateFont(XRef*, PSOutputDev*)’未定义的引用
CMakeFiles/pdf2htmlEX.dir/src/BackgroundRenderer/SplashBackgroundRenderer.cc.o:在函数‘pdf2htmlEX::SplashBackgroundRenderer::SplashBackgroundRenderer(std::string const&, pdf2htmlEX::HTMLRenderer*, pdf2htmlEX::Param const&)’中:
/home/sam/anbank-web/lib/pdf2htmlEX/src/BackgroundRenderer/SplashBackgroundRenderer.cc:35:对‘SplashOutputDev::SplashOutputDev(SplashColorMode, int, bool, unsigned char*, bool, SplashThinLineMode, bool)’未定义的引用
collect2: 错误:ld 返回 1
make[2]: *** [pdf2htmlEX] 错误 1
make[1]: *** [CMakeFiles/pdf2htmlEX.dir/all] 错误 2
make: *** [all] 错误 2
ln -s /usr/lib/libfontforge.so.2 /usr/lib64/libfontforge.so.2
ln -s /usr/lib/libfontforge.so.2 /usr/lib64/libfontforge.so
ln -s /usr/lib/libpoppler.so.54 /usr/lib64/libpoppler.so.54
报错原因:
/usr/lib64下的libpoppler.so版本不对的问题,问题已解决
cd /usr/lib64
[root@localhost lib64]# ll |grep popp
lrwxrwxrwx. 1 root root 24 9月 28 23:37 libpoppler-cpp.so -> libpoppler-cpp.so.10.2.0
lrwxrwxrwx. 1 root root 24 9月 28 23:37 libpoppler-cpp.so.10 -> libpoppler-cpp.so.10.2.0
-rwxr-xr-x. 1 root root 82680 8月 31 2017 libpoppler-cpp.so.10.2.0
lrwxrwxrwx. 1 root root 25 9月 28 23:36 libpoppler-glib.so -> libpoppler-glib.so.18.6.0
lrwxrwxrwx. 1 root root 25 9月 25 17:53 libpoppler-glib.so.18 -> libpoppler-glib.so.18.6.0
-rwxr-xr-x. 1 root root 370648 8月 31 2017 libpoppler-glib.so.18.6.0
lrwxrwxrwx. 1 root root 20 9月 25 17:53 libpoppler.so -> libpoppler.so.46.0.0
lrwxrwxrwx. 1 root root 20 9月 25 17:53 libpoppler.so.46 -> libpoppler.so.46.0.0
-rwxr-xr-x. 1 root root 2689272 8月 31 2017 libpoppler.so.46.0.0
lrwxrwxrwx. 1 root root 25 10月 6 15:07 libpoppler.so.54 -> /usr/lib/libpoppler.so.54
rm libpoppler.so
ln -s libpoppler.so.54 libpoppler.so
报错:
段错误(吐核)9
Segmentation fault( (core dumped))
报错原因:
fontforge或者其他依赖库版本太老
解决办法:
重新安装pdf2htmlEX
tail -f /var/log/messages
ct 5 23:33:09 localhost abrt-server: Executable '/usr/local/bin/pdf2htmlEX' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Oct 5 23:33:09 localhost abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2018-10-05-23:33:09-385272' exited with 1
Oct 5 23:33:09 localhost abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2018-10-05-23:33:09-385272'
Oct 5 23:33:14 localhost kernel: pdf2htmlEX[385279]: segfault at 6b00000064 ip 000000000044747d sp 00007ffd8197fb40 error 4 in pdf2htmlEX[400000+67000]
Oct 5 23:33:14 localhost abrt-hook-ccpp: Process 385279 (pdf2htmlEX) of user 1001 killed by SIGSEGV - ignoring (repeated crash)
报错1:
针对 Executable ‘/usr/local/bin/pdf2htmlEX’ doesn’t belong to any package and ProcessUnpackaged is set to ‘no’ 的解决
vim /etc/abrt/abrt-action-save-package-data.conf
将
ProcessUnpackaged = no
改为:
ProcessUnpackaged = yes
然后重启服务
service abrtd restart
报错2,但仍然报错:
[root@localhost ~]# tail -f /var/spool/mail/root
Process 385655 (pdf2htmlEX) of user 1001 killed by SIGSEGV - ignoring (=
repeated crash)
:10=E6=9C=88 06 00:03:13 localhost.localdomain kernel: pdf2htmlEX[38586=
1]: segfault at 6b00000064 ip 000000000044747d sp 00007fff21daa6a0 erro=
r 4 in pdf2htmlEX[400000+67000]
:10=E6=9C=88 06 00:03:13 localhost.localdomain abrt-hook-ccpp[385862]: =
Process 385861 (pdf2htmlEX) of user 1001 killed by SIGSEGV - dumping co=
re
:[User Logs]:
tail -f /var/log/messages 报错变为:
Oct 6 12:36:29 localhost kernel: pdf2htmlEX[395835]: segfault at 6b00000064 ip 000000000044747d sp 00007ffc789b0f80 error 4 in pdf2htmlEX[400000+67000]
Oct 6 12:36:29 localhost abrt-hook-ccpp: Process 395835 (pdf2htmlEX) of user 1001 killed by SIGSEGV - dumping core
Oct 6 12:36:29 localhost abrt-server: Duplicate: core backtrace
Oct 6 12:36:29 localhost abrt-server: DUP_OF_DIR: /var/spool/abrt/ccpp-2018-10-06-00:03:13-385861
Oct 6 12:36:29 localhost abrt-server: Deleting problem directory ccpp-2018-10-06-12:36:29-395835 (dup of ccpp-2018-10-06-00:03:13-385861)
Oct 6 12:36:29 localhost abrt-server: 未指定 sender 的电子邮箱。您想要现在制定吗?如果不,将使用 'user@localhost' [y/N]
Oct 6 12:36:29 localhost abrt-server: 未指定 receiver 的电子邮箱。您想要现在制定吗?如果不,将使用 'root@localhost' [y/N]
Oct 6 12:36:29 localhost abrt-server: Undefined variable outside of [[ ]] bracket
Oct 6 12:36:29 localhost abrt-server: 发送电子邮件......
Oct 6 12:36:29 localhost abrt-server: 向 root@localhost 发送通知邮件
Oct 6 12:36:29 localhost abrt-server: 已发送电子邮件至:root@localhost
共享目录
ldconfig -v
cd /home/sam/anbank-web/test/convert_data
pdf2htmlEX --hdpi 144 --vdpi 144 20180912-T-1-1.pdf --dest-dir test.html
四.解析Html文件
见BeautifulSoup的内容
PS: 两天时间,第一次用scrapy, beautifulsoap, pdf2htmlex提取了三家公司的产品信息,感觉自己棒棒哒
五、讨论
- pdf-to-word
https://smallpdf.com/cn/pdf-to-word
参考资料
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn